Problem Statement¶
Context¶
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
Objective¶
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
Data Dictionary¶
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)
Importing necessary libraries¶
!jupyter nbconvert --to html "/content/drive/MyDrive/Colab Notebooks/Final copy of project2 - Linear Regression Logistic Regression and Decision Tree - Jerlin.ipynb"
This application is used to convert notebook files (*.ipynb)
to various other formats.
WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.
Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
<cmd> --help-all
--debug
set log level to logging.DEBUG (maximize logging output)
Equivalent to: [--Application.log_level=10]
--show-config
Show the application's configuration (human-readable format)
Equivalent to: [--Application.show_config=True]
--show-config-json
Show the application's configuration (json format)
Equivalent to: [--Application.show_config_json=True]
--generate-config
generate default config file
Equivalent to: [--JupyterApp.generate_config=True]
-y
Answer yes to any questions instead of prompting.
Equivalent to: [--JupyterApp.answer_yes=True]
--execute
Execute the notebook prior to export.
Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
Write notebook output to stdout instead of files.
Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
Run nbconvert in place, overwriting the existing notebook (only
relevant when converting to notebook format)
Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
Clear output of current file and save in place,
overwriting the existing notebook.
Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--coalesce-streams
Coalesce consecutive stdout and stderr outputs into one stream (within each cell).
Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --CoalesceStreamsPreprocessor.enabled=True]
--no-prompt
Exclude input and output prompts from converted document.
Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
Exclude input cells and output prompts from converted document.
This mode is ideal for generating code-free reports.
Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
Whether to allow downloading chromium if no suitable version is found on the system.
Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
Disable chromium security sandbox when converting to PDF..
Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
Shows code input. This flag is only useful for dejavu users.
Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
Equivalent to: [--HTMLExporter.embed_images=True]
--sanitize-html
Whether the HTML in Markdown cells and cell outputs should be sanitized..
Equivalent to: [--HTMLExporter.sanitize_html=True]
--log-level=<Enum>
Set the log level by value or name.
Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
Default: 30
Equivalent to: [--Application.log_level]
--config=<Unicode>
Full path of a config file.
Default: ''
Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
The export format to be used, either one of the built-in formats
['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'qtpdf', 'qtpng', 'rst', 'script', 'slides', 'webpdf']
or a dotted object name that represents the import path for an
``Exporter`` class
Default: ''
Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
Name of the template to use
Default: ''
Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
Name of the template file to use
Default: None
Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
as prebuilt extension for the lab template)
Default: 'light'
Equivalent to: [--HTMLExporter.theme]
--sanitize_html=<Bool>
Whether the HTML in Markdown cells and cell outputs should be sanitized.This
should be set to True by nbviewer or similar tools.
Default: False
Equivalent to: [--HTMLExporter.sanitize_html]
--writer=<DottedObjectName>
Writer class used to write the
results of the conversion
Default: 'FilesWriter'
Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
PostProcessor class used to write the
results of the conversion
Default: ''
Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
Overwrite base name use for output files.
Supports pattern replacements '{notebook_name}'.
Default: '{notebook_name}'
Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
Directory to write output(s) to. Defaults
to output to the directory of each notebook. To recover
previous default behaviour (outputting to the current
working directory) use . as the flag value.
Default: ''
Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
The URL prefix for reveal.js (version 3.x).
This defaults to the reveal CDN, but can be any url pointing to a copy
of reveal.js.
For speaker notes to work, this must be a relative path to a local
copy of reveal.js: e.g., "reveal.js".
If a relative path is given, it must be a subdirectory of the
current directory (from which the server is run).
See the usage documentation
(https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
for more details.
Default: ''
Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
The nbformat version to write.
Use this to downgrade notebooks.
Choices: any of [1, 2, 3, 4]
Default: 4
Equivalent to: [--NotebookExporter.nbformat_version]
Examples
--------
The simplest way to use nbconvert is
> jupyter nbconvert mynotebook.ipynb --to html
Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'qtpdf', 'qtpng', 'rst', 'script', 'slides', 'webpdf'].
> jupyter nbconvert --to latex mynotebook.ipynb
Both HTML and LaTeX support multiple output templates. LaTeX includes
'base', 'article' and 'report'. HTML includes 'basic', 'lab' and
'classic'. You can specify the flavor of the format used.
> jupyter nbconvert --to html --template lab mynotebook.ipynb
You can also pipe the output to stdout, rather than a file
> jupyter nbconvert mynotebook.ipynb --stdout
PDF is generated via latex
> jupyter nbconvert mynotebook.ipynb --to pdf
You can get (and serve) a Reveal.js-powered slideshow
> jupyter nbconvert myslides.ipynb --to slides --post serve
Multiple notebooks can be given at the command line in a couple of
different ways:
> jupyter nbconvert notebook*.ipynb
> jupyter nbconvert notebook1.ipynb notebook2.ipynb
or you can specify the notebooks list in a config file, containing::
c.NbConvertApp.notebooks = ["my_notebook.ipynb"]
> jupyter nbconvert --config mycfg.py
To see all available configurables, use `--help-all`.
[NbConvertApp] WARNING | pattern '/content/drive/MyDrive/Colab Notebooks/Final copy of project2 - Linear Regression Logistic Regression and Decision Tree - Jerlin.ipynb' matched no files
# Installing the libraries with the specified version.
!pip install numpy pandas matplotlib seaborn scikit-learn sklearn-pandas -q --user
[notice] A new release of pip is available: 24.0 -> 24.2 [notice] To update, run: C:\Users\paluv\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
import sys
print(sys.version)
3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
Loading the dataset¶
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Library to split data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Library to build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Library to get different metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
)
# Library to build linear model for statistical analysis and prediction
from sklearn.linear_model import LinearRegression
#To build linear model for satistical analysis and prediction
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
#library required for structured python code
from sklearn_pandas import DataFrameMapper
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
sns.set_style('darkgrid')
sns.set(color_codes=True)
#Remove the limit for number of displayed columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# library required for structured Python code
from sklearn_pandas import DataFrameMapper
#Library to split the data
from sklearn.model_selection import train_test_split
#Library to build model for prediction
from sklearn.tree import DecisionTreeClassifier
#Library to build decision tree models
from sklearn import tree
#Library to build Logistic regression model
from sklearn.linear_model import LogisticRegression
#Library to get different metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
)
#Library to build linear model for statistical analysis and prediction
from sklearn.linear_model import LinearRegression
Data Overview¶
- Observations
- Sanity checks
# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_csv('//content/drive/MyDrive/Project2/Loan_Modelling.csv')
df = pd.read_csv('Loan_Modelling.csv')
#creating a copy of the dataset
df_copy = df.copy()
Data Structure Analysis¶
#Viewing top observations from the dataset
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
#Viewing bottom observations from the dataset
df.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
Observations:
- Age,experience and income indicate a wide range of numbers
- ID is set as the index
- Found mortgage has many 0's and 3 digit number indicating a skwed data
- Zipcode can be restructured into county for easy analysis
- Family and Education are under 5
- Personal_loan,securities_account,CD_Account,Online and creditc ard are all binary values
#Viewing the overall information of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
Observation¶
- There is no missing values listed
- Memory used is 547.0 KB
- There are 12 float integers and 1 float type columns.
df.isnull().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
Observation¶
- There is no null present
df.nunique()
ID 5000 Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 dtype: int64
Observation
- It give us an idea of number of unique values in each column.
#Checking the shape of the dataset
df.shape
(5000, 14)
There are 5000 rows and 13 columns in this dataframe.
#checking the duplicates
df.duplicated().sum()
0
There is no duplicates in the data.
Initial Data processing¶
##library required to convert zipcode to county
!pip install uszipcode
import sys
print(sys.version)
Requirement already satisfied: uszipcode in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (1.0.1) Requirement already satisfied: attrs in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (24.2.0) Requirement already satisfied: requests in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.32.3) Requirement already satisfied: pathlib-mate in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (1.3.2) Requirement already satisfied: atomicwrites in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (1.4.1) Requirement already satisfied: fuzzywuzzy in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (0.18.0) Requirement already satisfied: haversine>=2.5.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.8.1) Requirement already satisfied: SQLAlchemy>=1.4.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.0.35) Requirement already satisfied: sqlalchemy-mate>=1.4.28.3 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.0.0.0) Requirement already satisfied: typing-extensions>=4.6.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->uszipcode) (4.12.2) Requirement already satisfied: greenlet!=0.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->uszipcode) (3.1.1) Requirement already satisfied: prettytable<4.0.0,>=3.0.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from sqlalchemy-mate>=1.4.28.3->uszipcode) (3.11.0) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (2024.8.30) Requirement already satisfied: wcwidth in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from prettytable<4.0.0,>=3.0.0->sqlalchemy-mate>=1.4.28.3->uszipcode) (0.2.13) 3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
[notice] A new release of pip is available: 24.0 -> 24.2 [notice] To update, run: C:\Users\paluv\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
#user defined function
def zipcode_to_county(zipcode):
from uszipcode import SearchEngine
search = SearchEngine(simple_zipcode=True)
result = search.by_zipcode(zipcode)
return result.to_dict()['county']
#library required to convert zipcode to county
!pip install usZipcode==1.0.1 sqlalchemy_mate==2.0.0.0
from uszipcode import SearchEngine
import pandas as pd
def Zipcode_to_county(Zipc):
search = SearchEngine()
Zipcode = search.by_zipcode(Zipc) # Converting zipcode to zipcode object
if Zipcode: # Checking if Zipcode object exists
return Zipcode.county
else:
return None # Return None if zipcode is not found
#applying the user defined function and creating a new column 'County'
# Check if 'zip_code' column exists, if not, try variations or check your data
if 'zip_code' in df.columns:
df['county'] = df['zip_code'].apply(Zipcode_to_county)
else:
possible_column_names = ['zipcode', 'zip_code', 'ZIPCODE']
for col_name in possible_column_names:
if col_name in df.columns:
print(f"Found column '{col_name}', using it instead of 'zip_code'")
df['county'] = df[col_name].apply(Zipcode_to_county)
break
else:
print("No suitable column found in the DataFrame.")
Requirement already satisfied: usZipcode==1.0.1 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (1.0.1) Requirement already satisfied: sqlalchemy_mate==2.0.0.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (2.0.0.0) Requirement already satisfied: attrs in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (24.2.0) Requirement already satisfied: requests in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (2.32.3) Requirement already satisfied: pathlib-mate in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (1.3.2) Requirement already satisfied: atomicwrites in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (1.4.1) Requirement already satisfied: fuzzywuzzy in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (0.18.0) Requirement already satisfied: haversine>=2.5.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (2.8.1) Requirement already satisfied: SQLAlchemy>=1.4.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from usZipcode==1.0.1) (2.0.35) Requirement already satisfied: prettytable<4.0.0,>=3.0.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from sqlalchemy_mate==2.0.0.0) (3.11.0) Requirement already satisfied: wcwidth in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from prettytable<4.0.0,>=3.0.0->sqlalchemy_mate==2.0.0.0) (0.2.13) Requirement already satisfied: typing-extensions>=4.6.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->usZipcode==1.0.1) (4.12.2) Requirement already satisfied: greenlet!=0.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->usZipcode==1.0.1) (3.1.1) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->usZipcode==1.0.1) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->usZipcode==1.0.1) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->usZipcode==1.0.1) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->usZipcode==1.0.1) (2024.8.30)
[notice] A new release of pip is available: 24.0 -> 24.2 [notice] To update, run: C:\Users\paluv\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
No suitable column found in the DataFrame.
#Applying the user defined function and creating a new column 'County'
!pip install uszipcode
df['County'] = df['ZIPCode'].apply
Requirement already satisfied: uszipcode in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (1.0.1) Requirement already satisfied: attrs in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (24.2.0) Requirement already satisfied: requests in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.32.3) Requirement already satisfied: pathlib-mate in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (1.3.2) Requirement already satisfied: atomicwrites in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (1.4.1) Requirement already satisfied: fuzzywuzzy in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (0.18.0) Requirement already satisfied: haversine>=2.5.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.8.1) Requirement already satisfied: SQLAlchemy>=1.4.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.0.35) Requirement already satisfied: sqlalchemy-mate>=1.4.28.3 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from uszipcode) (2.0.0.0) Requirement already satisfied: typing-extensions>=4.6.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->uszipcode) (4.12.2) Requirement already satisfied: greenlet!=0.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from SQLAlchemy>=1.4.0->uszipcode) (3.1.1) Requirement already satisfied: prettytable<4.0.0,>=3.0.0 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from sqlalchemy-mate>=1.4.28.3->uszipcode) (3.11.0) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from requests->uszipcode) (2024.8.30) Requirement already satisfied: wcwidth in c:\users\paluv\appdata\local\packages\pythonsoftwarefoundation.python.3.11_qbz5n2kfra8p0\localcache\local-packages\python311\site-packages (from prettytable<4.0.0,>=3.0.0->sqlalchemy-mate>=1.4.28.3->uszipcode) (0.2.13)
[notice] A new release of pip is available: 24.0 -> 24.2 [notice] To update, run: C:\Users\paluv\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
search = SearchEngine()
df['County'] = df['ZIPCode'].apply(search.by_zipcode)
df['County'] = df['County'].apply(lambda x: x.to_dict()['county'] if x else None)
df.head(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Alameda County |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | San Francisco County |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | Los Angeles County |
| 5 | 6 | 37 | 13 | 29 | 92121 | 4 | 0.4 | 2 | 155 | 0 | 0 | 0 | 1 | 0 | San Diego County |
| 6 | 7 | 53 | 27 | 72 | 91711 | 2 | 1.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 7 | 8 | 50 | 24 | 22 | 93943 | 1 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | Monterey County |
| 8 | 9 | 35 | 10 | 81 | 90089 | 3 | 0.6 | 2 | 104 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 9 | 10 | 34 | 9 | 180 | 93023 | 1 | 8.9 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | Ventura County |
df.nunique()
ID 5000 Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 County 38 dtype: int64
Observation¶
467 unique values in ZIPCode decreased to 38 unique values in 'County'
#Making a list of columns
df.columns
Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard', 'County'],
dtype='object')
df.describe(include="all")
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 | 4966 |
| unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 38 |
| top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Los Angeles County |
| freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1095 |
| mean | 2500.500000 | 45.338400 | 20.104600 | 73.774200 | 93169.257000 | 2.396400 | 1.937938 | 1.881000 | 56.498800 | 0.096000 | 0.104400 | 0.06040 | 0.596800 | 0.294000 | NaN |
| std | 1443.520003 | 11.463166 | 11.467954 | 46.033729 | 1759.455086 | 1.147663 | 1.747659 | 0.839869 | 101.713802 | 0.294621 | 0.305809 | 0.23825 | 0.490589 | 0.455637 | NaN |
| min | 1.000000 | 23.000000 | -3.000000 | 8.000000 | 90005.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | NaN |
| 25% | 1250.750000 | 35.000000 | 10.000000 | 39.000000 | 91911.000000 | 1.000000 | 0.700000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | NaN |
| 50% | 2500.500000 | 45.000000 | 20.000000 | 64.000000 | 93437.000000 | 2.000000 | 1.500000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 0.000000 | NaN |
| 75% | 3750.250000 | 55.000000 | 30.000000 | 98.000000 | 94608.000000 | 3.000000 | 2.500000 | 3.000000 | 101.000000 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 1.000000 | NaN |
| max | 5000.000000 | 67.000000 | 43.000000 | 224.000000 | 96651.000000 | 4.000000 | 10.000000 | 3.000000 | 635.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | NaN |
# Printing top 5 unique values in each column
for i in df.columns:
print("\nUnique values in", i, "are :")
print(df[i].value_counts().head())
print("\n")
print("-" * 40)
Unique values in ID are : ID 1 1 3331 1 3338 1 3337 1 3336 1 Name: count, dtype: int64 ---------------------------------------- Unique values in Age are : Age 35 151 43 149 52 145 54 143 58 143 Name: count, dtype: int64 ---------------------------------------- Unique values in Experience are : Experience 32 154 20 148 9 147 5 146 23 144 Name: count, dtype: int64 ---------------------------------------- Unique values in Income are : Income 44 85 38 84 81 83 41 82 39 81 Name: count, dtype: int64 ---------------------------------------- Unique values in ZIPCode are : ZIPCode 94720 169 94305 127 95616 116 90095 71 93106 57 Name: count, dtype: int64 ---------------------------------------- Unique values in Family are : Family 1 1472 2 1296 4 1222 3 1010 Name: count, dtype: int64 ---------------------------------------- Unique values in CCAvg are : CCAvg 0.3 241 1.0 231 0.2 204 2.0 188 0.8 187 Name: count, dtype: int64 ---------------------------------------- Unique values in Education are : Education 1 2096 3 1501 2 1403 Name: count, dtype: int64 ---------------------------------------- Unique values in Mortgage are : Mortgage 0 3462 98 17 119 16 89 16 91 16 Name: count, dtype: int64 ---------------------------------------- Unique values in Personal_Loan are : Personal_Loan 0 4520 1 480 Name: count, dtype: int64 ---------------------------------------- Unique values in Securities_Account are : Securities_Account 0 4478 1 522 Name: count, dtype: int64 ---------------------------------------- Unique values in CD_Account are : CD_Account 0 4698 1 302 Name: count, dtype: int64 ---------------------------------------- Unique values in Online are : Online 1 2984 0 2016 Name: count, dtype: int64 ---------------------------------------- Unique values in CreditCard are : CreditCard 0 3530 1 1470 Name: count, dtype: int64 ---------------------------------------- Unique values in County are : County Los Angeles County 1095 San Diego County 568 Santa Clara County 563 Alameda County 500 Orange County 339 Name: count, dtype: int64 ----------------------------------------
Observation¶
- 0's out-number 1 in Mortgage,Personal_Loan, Securities_Account, CD_Account and CreditCard coulmns.
- Los Angeles County has the highest number of observations of 1095. There is handful of counties with under 50 observations.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 14 County 4966 non-null object dtypes: float64(1), int64(13), object(1) memory usage: 586.1+ KB
Datatypes of CCAvg is float and county is object.Rest of them are integers.
#Printing bottom 5 unique values in each column
for i in df.columns:
print("\nUnique values in", i, "are :")
print(df[i].value_counts().tail())
print("\n")
print("-" * 40)
Unique values in ID are : ID 1667 1 1666 1 1665 1 1664 1 5000 1 Name: count, dtype: int64 ---------------------------------------- Unique values in Age are : Age 25 53 24 28 66 24 67 12 23 12 Name: count, dtype: int64 ---------------------------------------- Unique values in Experience are : Experience -1 33 -2 15 42 8 -3 4 43 3 Name: count, dtype: int64 ---------------------------------------- Unique values in Income are : Income 202 2 203 2 189 2 224 1 218 1 Name: count, dtype: int64 ---------------------------------------- Unique values in ZIPCode are : ZIPCode 96145 1 94087 1 91024 1 93077 1 94598 1 Name: count, dtype: int64 ---------------------------------------- Unique values in Family are : Family 1 1472 2 1296 4 1222 3 1010 Name: count, dtype: int64 ---------------------------------------- Unique values in CCAvg are : CCAvg 3.25 1 3.67 1 4.67 1 8.90 1 2.75 1 Name: count, dtype: int64 ---------------------------------------- Unique values in Education are : Education 1 2096 3 1501 2 1403 Name: count, dtype: int64 ---------------------------------------- Unique values in Mortgage are : Mortgage 547 1 458 1 505 1 361 1 541 1 Name: count, dtype: int64 ---------------------------------------- Unique values in Personal_Loan are : Personal_Loan 0 4520 1 480 Name: count, dtype: int64 ---------------------------------------- Unique values in Securities_Account are : Securities_Account 0 4478 1 522 Name: count, dtype: int64 ---------------------------------------- Unique values in CD_Account are : CD_Account 0 4698 1 302 Name: count, dtype: int64 ---------------------------------------- Unique values in Online are : Online 1 2984 0 2016 Name: count, dtype: int64 ---------------------------------------- Unique values in CreditCard are : CreditCard 0 3530 1 1470 Name: count, dtype: int64 ---------------------------------------- Unique values in County are : County Merced County 4 Trinity County 4 Lake County 4 Imperial County 3 Napa County 3 Name: count, dtype: int64 ----------------------------------------
Observation
- There are negative values in 'Experience' that needs to be fixed
- There are counties under 5 observations as opposed to Los Angeles County with 1095 observations.
df[df['Experience'] < 0].value_counts().sum()
51
- Overall 51 rows with negative experience
df[df['Experience'] < 0].sort_values(by='Experience', ascending=True)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4514 | 4515 | 24 | -3 | 41 | 91768 | 4 | 1.00 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 2618 | 2619 | 23 | -3 | 55 | 92704 | 3 | 2.40 | 2 | 145 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 4285 | 4286 | 23 | -3 | 149 | 93555 | 2 | 7.20 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Kern County |
| 3626 | 3627 | 24 | -3 | 28 | 90089 | 4 | 1.00 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | Los Angeles County |
| 2717 | 2718 | 23 | -2 | 45 | 95422 | 4 | 0.60 | 2 | 0 | 0 | 0 | 0 | 1 | 1 | Lake County |
| 2876 | 2877 | 24 | -2 | 80 | 91107 | 2 | 1.60 | 3 | 238 | 0 | 0 | 0 | 0 | 0 | Los Angeles County |
| 3130 | 3131 | 23 | -2 | 82 | 92152 | 2 | 1.80 | 2 | 0 | 0 | 1 | 0 | 0 | 1 | San Diego County |
| 2466 | 2467 | 24 | -2 | 80 | 94105 | 2 | 1.60 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | San Francisco County |
| 3796 | 3797 | 24 | -2 | 50 | 94920 | 3 | 2.40 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | Marin County |
| 2962 | 2963 | 23 | -2 | 81 | 91711 | 2 | 1.80 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Los Angeles County |
| 793 | 794 | 24 | -2 | 150 | 94720 | 2 | 2.00 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Alameda County |
| 3887 | 3888 | 24 | -2 | 118 | 92634 | 2 | 7.20 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | None |
| 889 | 890 | 24 | -2 | 82 | 91103 | 2 | 1.60 | 3 | 0 | 0 | 0 | 0 | 1 | 1 | Los Angeles County |
| 597 | 598 | 24 | -2 | 125 | 92835 | 2 | 7.20 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | Orange County |
| 315 | 316 | 24 | -2 | 51 | 90630 | 3 | 0.30 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 451 | 452 | 28 | -2 | 48 | 94132 | 2 | 1.75 | 3 | 89 | 0 | 0 | 0 | 1 | 0 | San Francisco County |
| 4116 | 4117 | 24 | -2 | 135 | 90065 | 2 | 7.20 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 4481 | 4482 | 25 | -2 | 35 | 95045 | 4 | 1.00 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | San Benito County |
| 4411 | 4412 | 23 | -2 | 75 | 90291 | 2 | 1.80 | 2 | 0 | 0 | 0 | 0 | 1 | 1 | Los Angeles County |
| 4015 | 4016 | 25 | -1 | 139 | 93106 | 2 | 2.00 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | Santa Barbara County |
| 3157 | 3158 | 23 | -1 | 13 | 94720 | 4 | 1.00 | 1 | 84 | 0 | 0 | 0 | 1 | 0 | Alameda County |
| 3279 | 3280 | 26 | -1 | 44 | 94901 | 1 | 2.00 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Marin County |
| 3284 | 3285 | 25 | -1 | 101 | 95819 | 4 | 2.10 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | Sacramento County |
| 3292 | 3293 | 25 | -1 | 13 | 95616 | 4 | 0.40 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Yolo County |
| 3394 | 3395 | 25 | -1 | 113 | 90089 | 4 | 2.10 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 3425 | 3426 | 23 | -1 | 12 | 91605 | 4 | 1.00 | 1 | 90 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 4088 | 4089 | 29 | -1 | 71 | 94801 | 2 | 1.75 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | Contra Costa County |
| 3824 | 3825 | 23 | -1 | 12 | 95064 | 4 | 1.00 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | Santa Cruz County |
| 3946 | 3947 | 25 | -1 | 40 | 93117 | 3 | 2.40 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Santa Barbara County |
| 3076 | 3077 | 29 | -1 | 62 | 92672 | 2 | 1.75 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | Orange County |
| 89 | 90 | 25 | -1 | 113 | 94303 | 4 | 2.30 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | San Mateo County |
| 2848 | 2849 | 24 | -1 | 78 | 94720 | 2 | 1.80 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Alameda County |
| 226 | 227 | 24 | -1 | 39 | 94085 | 2 | 1.70 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Santa Clara County |
| 524 | 525 | 24 | -1 | 75 | 93014 | 4 | 0.20 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Santa Barbara County |
| 536 | 537 | 25 | -1 | 43 | 92173 | 3 | 2.40 | 2 | 176 | 0 | 0 | 0 | 1 | 0 | San Diego County |
| 540 | 541 | 25 | -1 | 109 | 94010 | 4 | 2.30 | 3 | 314 | 0 | 0 | 0 | 1 | 0 | San Mateo County |
| 576 | 577 | 25 | -1 | 48 | 92870 | 3 | 0.30 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | Orange County |
| 583 | 584 | 24 | -1 | 38 | 95045 | 2 | 1.70 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | San Benito County |
| 649 | 650 | 25 | -1 | 82 | 92677 | 4 | 2.10 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 670 | 671 | 23 | -1 | 61 | 92374 | 4 | 2.60 | 1 | 239 | 0 | 0 | 0 | 1 | 0 | San Bernardino County |
| 2980 | 2981 | 25 | -1 | 53 | 94305 | 3 | 2.40 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Santa Clara County |
| 686 | 687 | 24 | -1 | 38 | 92612 | 4 | 0.60 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 1173 | 1174 | 24 | -1 | 35 | 94305 | 2 | 1.70 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Santa Clara County |
| 1428 | 1429 | 25 | -1 | 21 | 94583 | 4 | 0.40 | 1 | 90 | 0 | 0 | 0 | 1 | 0 | Contra Costa County |
| 1522 | 1523 | 25 | -1 | 101 | 94720 | 4 | 2.30 | 3 | 256 | 0 | 0 | 0 | 0 | 1 | Alameda County |
| 1905 | 1906 | 25 | -1 | 112 | 92507 | 2 | 2.00 | 1 | 241 | 0 | 0 | 0 | 1 | 0 | Riverside County |
| 2102 | 2103 | 25 | -1 | 81 | 92647 | 2 | 1.60 | 3 | 0 | 0 | 0 | 0 | 1 | 1 | Orange County |
| 2430 | 2431 | 23 | -1 | 73 | 92120 | 4 | 2.60 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | San Diego County |
| 2545 | 2546 | 25 | -1 | 39 | 94720 | 3 | 2.40 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Alameda County |
| 4582 | 4583 | 25 | -1 | 69 | 92691 | 3 | 0.30 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 909 | 910 | 23 | -1 | 149 | 91709 | 1 | 6.33 | 1 | 305 | 0 | 0 | 0 | 0 | 1 | San Bernardino County |
| 4957 | 4958 | 29 | -1 | 50 | 95842 | 2 | 1.75 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | Sacramento County |
Observation
- There are 51 rows with negative values for Experience
- Let's look at the corresponding age, education and income values for any pattern.
df[df['Experience']<0]["Age"].value_counts()
Age 25 18 24 17 23 12 29 3 28 1 26 1 Name: count, dtype: int64
df[df["Experience"]<0]["Education"].value_counts()
Education 3 20 2 16 1 16 Name: count, dtype: int64
df[df["Experience"]<0]["Income"].value_counts()
Income 82 3 113 2 149 2 12 2 80 2 39 2 81 2 101 2 50 2 35 2 13 2 38 2 75 2 48 2 62 1 40 1 139 1 71 1 28 1 135 1 41 1 44 1 118 1 73 1 53 1 78 1 45 1 55 1 112 1 21 1 150 1 61 1 125 1 109 1 43 1 51 1 69 1 Name: count, dtype: int64
- Considering the corresponding Age,education,and incoe values, there does not seem to be any pattern and hence the negative value in expeience could most probably be a data entry error. Therefore will convert these values to absolute values instead of treating them as missing values.
#user defined function to convert negative values to absolute values
def convert_to_absolute(value):
if value < 0:
return abs(value)
else:
return value
df['Experience'] = df['Experience'].apply(convert_to_absolute)
df["Experience"].unique()
array([ 1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18,
21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29,
3, 22, 34, 0, 38, 40, 33, 4, 42, 43], dtype=int64)
There is no more negative values in Experience.
#Checking the statistical summary of the dataset
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.134600 | 11.415189 | 0.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
- There are 5000 observations present in all
- The minimum and maximum values indicate a widely spread in Age,Experience,Income and Mortgage *. Differences between mean and medium values indicate skewness in the data.
Exploratory Data Analysis.¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
- How many customers have credit cards?
- What are the attributes that have a strong correlation with the target attribute (personal loan)?
- How does a customer's interest in purchasing a loan vary with their age?
- How does a customer's interest in purchasing a loan vary with their education?
Univariate Analysis¶
#Defining a method to print the value counts in the plots
def val_count(plot, feature):
for p in ax.patches:
ax.annotate( # annotate the count
p.get_height(),
(
p.get_x() + p.get_width() / 2.0, # calculating the value counts
p.get_height(),
),
ha="center", # horizontal alignment of the text
va="center", # vertical alignment
xytext=(0, 5), # text position
textcoords="offset points", # offset in points from the xy value
)
#Defining a method to print the percentage of data points in the plot
def perc_on_bar(plot, feature):
"""
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
"""
total = len(feature) # length of the column
for p in ax.patches:
percentage = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size=12) # annotate the percantage
plt.show() # show the plot
# Defining a method to plot histogram and boxplot combined in a single plot
def histogram_boxplot(feature, figsize=(15, 10), bins=None):
"""Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(
feature, kde=F, ax=ax_hist2, bins=bins, color="orange"
) if bins else sns.distplot(
feature, kde=False, ax=ax_hist2, color="green"
) # For histogram
ax_hist2.axvline(
np.mean(feature), color="purple", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
np.median(feature), color="black", linestyle="-"
) # Add median to the histogram
#setting the color palette for the plots
sns.set_palette("husl")
1. Age¶
import matplotlib.pyplot as plt
import seaborn as sns
# plot Histogram and boxplot combined in a single
def histogram_boxplot(data, feature):
fig, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.25, .75)})
sns.boxplot(data=data, x=feature, ax=ax_box)
sns.histplot(data=data, x=feature, ax=ax_hist)
ax_box.set(yticks=[])
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
histogram_boxplot(df, "Age")
#Add mean to the histogram
plt.axvline(df["Age"].mean(), color="k", linestyle="--")
#Add median to the histogram
plt.axvline(df["Age"].median(), color="r", linestyle="-")
<matplotlib.lines.Line2D at 0x1b2161cccd0>
**Observation
- The uniform distribution with five peak values are present in the histogram.
- Mean and Median are very close to each other
- The boxplot indicates no outliers.
2. Experience¶
histogram_boxplot(df, "Experience")
#Add mean to the histogram
plt.axvline(df["Experience"].mean(), color="k", linestyle="--")
#Add median to the histogram
plt.axvline(df["Experience"].median(), color="r", linestyle="-")
<matplotlib.lines.Line2D at 0x1b216473010>
- Mean and median are same
- Boxplot show no outliers
- The histrogram shows a uniform distribution with five peak values as like age
3. Income¶
histogram_boxplot(df, "Income")
#Add mean to the histogram
plt.axvline(df["Income"].mean(), color="k", linestyle="--")
#Add median to the histogram
plt.axvline(df["Income"].median(), color="r", linestyle="-")
<matplotlib.lines.Line2D at 0x1b218804190>
The distributon in the boxplot is right skewed. The outliers spotted above the upper whisker in the boxplot. The mean and medium are same.
4. Family¶
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="Family")
perc_on_bar(ax, df["Family"])
- Maximum number of customers are with a family size of 1.The next family sizes which have more percentage are 2 and 4
- The least family size in the graph is 3(20.2%)
5. Credit Card Average¶
import matplotlib.pyplot as plt
import seaborn as sns
def histogram_boxplot(data, feature):
fig, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.25, .75)})
sns.boxplot(data=data, x=feature, ax=ax_box)
sns.histplot(data=data, x=feature, ax=ax_hist)
ax_box.set(yticks=[])
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
# plot Histogram and boxplot combined in a single
histogram_boxplot(df, "CCAvg")
#Add mean to the histogram
plt.axvline(df["CCAvg"].mean(), color="k", linestyle="--")
#Add median to the histogram
plt.axvline(df["CCAvg"].median(), color="r", linestyle="-")
<matplotlib.lines.Line2D at 0x1b2188db790>
- Right skewed data
- Outliers above the upper whisker
6. Education¶
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="Education")
perc_on_bar(ax, df["Education"])
Highest percentage of customer are under graduates
About 28% of the customers are graduates
Above 305of them are advanced/professionals
The distribution suggests that 'Education' can be treated as category.
7.Mortgage¶
Mortgage == df["Mortgage"]
Mortgage.value_counts()
print(Mortgage.value_counts())
Mortgage.unique()
histogram_boxplot(df, "Mortgage")
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[53], line 1 ----> 1 Mortgage == df["Mortgage"] 2 Mortgage.value_counts() 3 print(Mortgage.value_counts()) NameError: name 'Mortgage' is not defined
- Highly skewed towards '0'
- Maximum number of customers do not have mortgage
- Outliers spotted above the upper whisker
- Most do further analysis to determine if 'Mortgage'can also be treated as category.
8. Personal Loan¶
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd # Import pandas for data manipulation
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="Personal_Loan")
perc_on_bar(ax, df["Personal_Loan"])
1.It is found that around 90% of the customers did not accept the personal loan offered in the previous campaign. 2.There is only minimum amount of 9.6% of customers have accepted the personal loan.
9. Securities¶
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="Securities_Account")
perc_on_bar(ax, df["Securities_Account"])
Only 10.4% of the customers hold a securities account; The majority of te customers don't have this account.
10. Certificate of Deposit Account¶
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="CD_Account")
perc_on_bar(ax, df["CD_Account"])
Only 6% of te customers hold a CD account; 94% of the customer don't have CD account.
11. Online¶
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="Online")
perc_on_bar(ax, df["Online"])
59.7% of the customers utilise internet banking facilities while 40.3% of them don't utilize online banking.
12.Credit Card¶
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x="CreditCard")
perc_on_bar(ax, df["CreditCard"])
29.4% of customers use a credit card issues by other banks while 70.6% of the customers don't have credit cards.
13.county¶
plt.figure(figsize=(35, 20))
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd # Import pandas for data manipulation
ax = sns.countplot(df["County"])
#creating a countplot and assigning it to variable ax
val_count(ax, "County")
#Labelled on the plot
plt.xticks(rotation=90)
plt.show
matplotlib.pyplot.show
def show(*args, **kwargs)
Display all open figures. Parameters ---------- block : bool, optional Whether to wait for all figures to be closed before returning. If `True` block and run the GUI main loop until all figure windows are closed. If `False` ensure that all figure windows are displayed and return immediately. In this case, you are responsible for ensuring that the event loop is running to have responsive figures. Defaults to True in non-interactive mode and to False in interactive mode (see `.pyplot.isinteractive`). See Also -------- ion : Enable interactive mode, which shows / updates the figure after every plotting command, so that calling ``show()`` is not necessary. ioff : Disable interactive mode. savefig : Save the figure to an image file instead of showing it on screen. Notes ----- **Saving figures to file and showing a window at the same time** If you want an image file as well as a user interface window, use `.pyplot.savefig` before `.pyplot.show`. At the end of (a blocking) ``show()`` the figure is closed and thus unregistered from pyplot. Calling `.pyplot.savefig` afterwards would save a new and thus empty figure. This limitation of command order does not apply if the show is non-blocking or if you keep a reference to the figure and use `.Figure.savefig`. **Auto-show in jupyter notebooks** The jupyter backends (activated via ``%matplotlib inline``, ``%matplotlib notebook``, or ``%matplotlib widget``), call ``show()`` at the end of every cell by default. Thus, you usually don't have to call it explicitly there.
Los Angeles county has the highest number of observations of 1095. There is handful of counties with under 50 observations.
Bivariate Analysis¶
## Function to plot stacked bar chart
def stacked_plot(x):
# sns.set(palette="nipy_spectral")
tab1 = pd.crosstab(x, data["Personal_Loan"], margins=True)
print(tab1)
print("-" * 120)
tab = pd.crosstab(x, data["Personal_Loan"], normalize="index")
tab.plot(kind="bar", stacked=True, figsize=(10, 5))
plt.legend(loc="lower left", frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(0, 1))
plt.xticks(rotation=0)
plt.show()
# outlier detection using boxplot
import seaborn as sns
import matplotlib.pyplot as plt
#convert categorical data into numerical data
numerical_features = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
# Check if the feature exists in the DataFrame before plotting
if feature in df.columns:
plt.subplot(3, 3, i + 1)
sns.boxplot(x=df[feature])
plt.title(f'Boxplot of {feature}')
plt.xlabel(feature)
plt.tight_layout()
plt.show()
else:
print(f"Feature '{feature}' not found in the DataFrame.")
continue
- Pesonal Loan VS Age
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Age", hue="Personal_Loan", showmeans=True)
- The IQR of the age of customers who accepted the personal loan and those rejected the personal loan are both same.
- Both have the same mean values
- No outliers spotted on both boxplots
- Personal Loan VS Experience
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Experience", hue="Personal_Loan", showmeans=True)
plt.show()
- The mean of both boxplots are almost same
- The IQR ofthe experince of customers is almost similar for those accepted and rejected the loan
- No outliers are spotted.
- Personal Loan VS Income
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Income", hue="Personal_Loan", showmeans=True)
plt.show()
- Customers with higher average annual income seem to have accepted the personal loan
- The minimum average incoem of customers who accepted the loan is around $60000.00
- Outliers spotted above the upper whisker for the ones with no personal loan
- Personal Loan VS Family
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Family", hue="Personal_Loan", showmeans=True)
plt.show()
ax = sns.countplot(data=df, x="Family", hue="Personal_Loan")
perc_on_bar(ax, df["Family"])
df["Family"].value_counts()
df["Family"].unique()
array([4, 3, 1, 2])
- These two graphs indicates that family size does not make a significant impact in making the choice of a loan
- On boxplot, we can see that customers with a family of 3 out numbers the rest in accpeting the loan.
- Personal Loan VS Credit Card Average
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="CCAvg", hue="Personal_Loan", showmeans=True)
plt.show()
- It is easy to understand from the boxplot that the customers who spent more on credit card per month on an average,seem to take personal loan compares to other customers.
- Personal Loan VS Education
def stacked_plot(x):
"""
Creates a stacked bar plot showing the distribution of 'Personal_Loan'
across different categories of the given column 'x'.
Args:
x: The column to use for the x-axis of the plot.
"""
# sns.set(palette="nipy_spectral")
# Replace 'data' with 'df' to use the correct DataFrame
tab1 = pd.crosstab(x, df["Personal_Loan"], margins=True)
print(tab1)
print("-" * 120)
# ... (Rest of your stacked_plot function)
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Education", hue="Personal_Loan", showmeans=True)
plt.show()
stacked_plot(df["Education"])
Personal_Loan 0 1 All Education 1 2003 93 2096 2 1221 182 1403 3 1296 205 1501 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
- Customer with graduate and advanced/profession levels of education have higher preference for personal loan than those customers who are Undergraduates.
- Personal Loan Vs Mortagage
plt.figure(figsize=(10, 6))
ax = sns.boxplot(data=df, x="Mortgage", hue="Personal_Loan", showmeans=True)
plt.show()
We can see that Mortgage is skewed towards 0 , no significant lower shikers seen on the boxplot. It is understandable that the customers with mortgage seem to have accepted more personal loan on an average.
- Personal Loan VS Securities Account
stacked_plot(df["Securities_Account"])
sns.set(palette="nipy_spectral")
ax = sns.countplot(data=df, x="Securities_Account", hue="Personal_Loan")
perc_on_bar(ax, df["Securities_Account"])
Personal_Loan 0 1 All Securities_Account 0 4058 420 4478 1 462 60 522 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
Around 90% of customer who have securities account are not accepting personal loan; however, only 10% of the customers are seem to accept the personal loan.
9. Personal Loan Vs CD Account¶
stacked_plot(df["CD_Account"])
ax = sns.countplot(data=df, x="CD_Account", hue="Personal_Loan")
perc_on_bar(ax, df["CD_Account"])
Personal_Loan 0 1 All CD_Account 0 4358 340 4698 1 162 140 302 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
- There are only 140 customers were accepted the loan out of 480 customers.
10. PERSONAL LOAN VS ONLINE¶
stacked_plot(df["Online"])
ax = sns.countplot(data=df, x="Online", hue="Personal_Loan")
perc_on_bar(ax, df["Online"])
Personal_Loan 0 1 All Online 0 1827 189 2016 1 2693 291 2984 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
- Found that only 291 of the customers have accepted the personal loan out of 2984 customers who avail the internet banking facilities.
11. Personal Loan VS Credit Card¶
stacked_plot(df["CreditCard"])
ax = sns.countplot(data=df, x="CreditCard", hue="Personal_Loan")
perc_on_bar(ax, df["CreditCard"])
Personal_Loan 0 1 All CreditCard 0 3193 337 3530 1 1327 143 1470 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
- It is showing that 1470 customers who use a credit card issued by other bank,143 of them have a pesonal loan with AllLife Bank.
12. Personal Loan VS County¶
pd.crosstab(df["County"], df["Personal_Loan"]).plot(kind="bar", stacked=True, figsize=(18, 6))
<Axes: xlabel='County'>
- It is remarkable that the Los Angeles County has the maximum number of customer who borrowed a personal loan.
- Santa Clara and San Diego counties are leading in numbers after Los Angeles county.
- Personal loan values for imperial county and Napa county need to be checked.
Multivariate Analysis¶
print(df.columns)
Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard', 'County'],
dtype='object')
plt.figure(figsize=(16, 7))
#convert categorical data into numerical data
numerical_features = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
sns.pairplot(df[numerical_features])
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Select only numerical features for correlation analysis
numerical_df = df.select_dtypes(include=np.number)
plt.figure(figsize=(16, 7))
# Generate heatmap using the numerical features
ax = sns.heatmap(numerical_df.corr(), vmin=-1, vmax=1, annot=True, fmt=".2f", cmap="Spectral")
plt.show()
<Figure size 1600x700 with 0 Axes>
- Found that age and experience have a very high correlation of 0.99
- Income and CCAvg ahve a ositive correlation of 0.65
- Personal loan and CD Account and securities account and CD Account have an equal number of correlation of 0.32.
- Personal loan and CCAvg have a correlation of 0.37
- Personal loan and income have the correlation of 0.50
- The least correlation of 0.28 is showing between CD account and creditcard.
sns.pairplot(df, hue="Personal_Loan", corner = True)
plt.show()
- It is clear that the customers with higher income, CCAvg and Mortgage has borrowed a personal loan.
col1 = df[["Age", "Experience", "Income", "CCAvg", "Mortgage", "Personal_Loan"]].columns.to_list()
col1.remove("Personal_Loan")
print(col1)
plt.figure(figsize=(16, 7))
for i, variable in enumerate(col1):
plt.subplot(3, 3, i + 1)
sns.boxplot(x="Personal_Loan", y=variable, data=df, showfliers = False)
plt.title(f'Boxplot of {variable}')
plt.xlabel(variable)
plt.show()
['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
- Based on these boxplots, it is prove that the observation from the pairplot seems correct.
- Customers with higher income, CCAvg and Mortgage have borrowed personal loan.
- Age and Experience show similar distribution as expected as have high positive correlation.
Data Preprocessing¶
- Missing value treatment
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
#Missing value treatment
df.isnull().sum()
| 0 | |
|---|---|
| ID | 0 |
| Age | 0 |
| Experience | 0 |
| Income | 0 |
| ZIPCode | 0 |
| Family | 0 |
| CCAvg | 0 |
| Education | 0 |
| Mortgage | 0 |
| Personal_Loan | 0 |
| Securities_Account | 0 |
| CD_Account | 0 |
| Online | 0 |
| CreditCard | 0 |
| County | 34 |
There is no missing value in this dataframe.
#checking feature engineering need in a data frame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 14 County 4966 non-null object dtypes: float64(1), int64(13), object(1) memory usage: 586.1+ KB
The memory usage of this data is 586.1KB. It contains 5000 rows and 15 columns.
df["County"]. value_counts()
| count | |
|---|---|
| County | |
| Los Angeles County | 1095 |
| San Diego County | 568 |
| Santa Clara County | 563 |
| Alameda County | 500 |
| Orange County | 339 |
| San Francisco County | 257 |
| San Mateo County | 204 |
| Sacramento County | 184 |
| Santa Barbara County | 154 |
| Yolo County | 130 |
| Monterey County | 128 |
| Ventura County | 114 |
| San Bernardino County | 101 |
| Contra Costa County | 85 |
| Santa Cruz County | 68 |
| Riverside County | 56 |
| Marin County | 54 |
| Kern County | 54 |
| San Luis Obispo County | 33 |
| Solano County | 33 |
| Humboldt County | 32 |
| Sonoma County | 28 |
| Fresno County | 26 |
| Placer County | 24 |
| Butte County | 19 |
| Shasta County | 18 |
| El Dorado County | 17 |
| Stanislaus County | 15 |
| San Benito County | 14 |
| San Joaquin County | 13 |
| Mendocino County | 8 |
| Siskiyou County | 7 |
| Tuolumne County | 7 |
| Merced County | 4 |
| Trinity County | 4 |
| Lake County | 4 |
| Imperial County | 3 |
| Napa County | 3 |
#Grouping counties under 40 observations into 'other county'
county = df["County"].value_counts().tail(20).keys()
df["County"] = np.where(df["County"].isin(county), "Other County", df["County"])
df["County"]. value_counts()
| count | |
|---|---|
| County | |
| Los Angeles County | 1095 |
| San Diego County | 568 |
| Santa Clara County | 563 |
| Alameda County | 500 |
| Orange County | 339 |
| Other County | 312 |
| San Francisco County | 257 |
| San Mateo County | 204 |
| Sacramento County | 184 |
| Santa Barbara County | 154 |
| Yolo County | 130 |
| Monterey County | 128 |
| Ventura County | 114 |
| San Bernardino County | 101 |
| Contra Costa County | 85 |
| Santa Cruz County | 68 |
| Riverside County | 56 |
| Marin County | 54 |
| Kern County | 54 |
df["Personal_Loan"].value_counts()
| count | |
|---|---|
| Personal_Loan | |
| 0 | 4520 |
| 1 | 480 |
df["Securities_Account"].value_counts()
| count | |
|---|---|
| Securities_Account | |
| 0 | 4478 |
| 1 | 522 |
df["CD_Account"].value_counts()
| count | |
|---|---|
| CD_Account | |
| 0 | 4698 |
| 1 | 302 |
df["Online"].value_counts()
| count | |
|---|---|
| Online | |
| 1 | 2984 |
| 0 | 2016 |
df["CreditCard"].value_counts()
| count | |
|---|---|
| CreditCard | |
| 0 | 3530 |
| 1 | 1470 |
df["Mortgage"].value_counts()
| count | |
|---|---|
| Mortgage | |
| 0 | 3462 |
| 98 | 17 |
| 119 | 16 |
| 89 | 16 |
| 91 | 16 |
| 103 | 16 |
| 83 | 16 |
| 102 | 15 |
| 90 | 15 |
| 78 | 15 |
| 87 | 14 |
| 94 | 14 |
| 131 | 14 |
| 118 | 14 |
| 104 | 14 |
| 101 | 14 |
| 144 | 13 |
| 106 | 13 |
| 81 | 13 |
| 116 | 13 |
| 109 | 13 |
| 112 | 13 |
| 97 | 12 |
| 120 | 12 |
| 100 | 12 |
| 76 | 12 |
| 185 | 12 |
| 121 | 12 |
| 84 | 11 |
| 137 | 11 |
| 79 | 11 |
| 115 | 11 |
| 153 | 11 |
| 95 | 11 |
| 86 | 11 |
| 111 | 11 |
| 158 | 11 |
| 184 | 11 |
| 113 | 10 |
| 108 | 10 |
| 142 | 10 |
| 117 | 10 |
| 82 | 10 |
| 151 | 10 |
| 161 | 10 |
| 135 | 10 |
| 148 | 9 |
| 149 | 9 |
| 166 | 9 |
| 88 | 9 |
| 221 | 9 |
| 146 | 9 |
| 123 | 9 |
| 167 | 9 |
| 128 | 9 |
| 159 | 9 |
| 138 | 8 |
| 205 | 8 |
| 147 | 8 |
| 174 | 8 |
| 218 | 8 |
| 122 | 8 |
| 75 | 8 |
| 140 | 8 |
| 170 | 8 |
| 194 | 8 |
| 110 | 8 |
| 114 | 8 |
| 169 | 8 |
| 157 | 8 |
| 164 | 8 |
| 105 | 8 |
| 207 | 8 |
| 132 | 8 |
| 96 | 7 |
| 127 | 7 |
| 124 | 7 |
| 229 | 7 |
| 204 | 7 |
| 230 | 7 |
| 85 | 7 |
| 249 | 7 |
| 219 | 7 |
| 80 | 7 |
| 126 | 7 |
| 196 | 7 |
| 129 | 7 |
| 154 | 7 |
| 193 | 7 |
| 130 | 6 |
| 187 | 6 |
| 145 | 6 |
| 236 | 6 |
| 192 | 6 |
| 134 | 6 |
| 136 | 6 |
| 150 | 6 |
| 240 | 6 |
| 251 | 6 |
| 155 | 6 |
| 163 | 6 |
| 182 | 6 |
| 99 | 6 |
| 141 | 6 |
| 107 | 6 |
| 125 | 6 |
| 198 | 5 |
| 272 | 5 |
| 227 | 5 |
| 188 | 5 |
| 307 | 5 |
| 180 | 5 |
| 301 | 5 |
| 220 | 5 |
| 171 | 5 |
| 172 | 5 |
| 212 | 5 |
| 209 | 5 |
| 256 | 5 |
| 178 | 5 |
| 217 | 5 |
| 294 | 5 |
| 156 | 5 |
| 92 | 4 |
| 245 | 4 |
| 162 | 4 |
| 342 | 4 |
| 232 | 4 |
| 93 | 4 |
| 189 | 4 |
| 231 | 4 |
| 297 | 4 |
| 200 | 4 |
| 199 | 4 |
| 241 | 4 |
| 181 | 4 |
| 247 | 4 |
| 211 | 4 |
| 77 | 4 |
| 224 | 4 |
| 175 | 4 |
| 179 | 4 |
| 203 | 4 |
| 152 | 4 |
| 190 | 4 |
| 139 | 4 |
| 244 | 4 |
| 325 | 3 |
| 366 | 3 |
| 226 | 3 |
| 315 | 3 |
| 329 | 3 |
| 133 | 3 |
| 282 | 3 |
| 422 | 3 |
| 215 | 3 |
| 223 | 3 |
| 255 | 3 |
| 197 | 3 |
| 228 | 3 |
| 270 | 3 |
| 323 | 3 |
| 143 | 3 |
| 242 | 3 |
| 268 | 3 |
| 277 | 3 |
| 168 | 3 |
| 186 | 3 |
| 257 | 3 |
| 352 | 3 |
| 248 | 3 |
| 264 | 3 |
| 380 | 3 |
| 319 | 3 |
| 341 | 3 |
| 428 | 3 |
| 238 | 3 |
| 275 | 3 |
| 176 | 3 |
| 263 | 3 |
| 260 | 3 |
| 333 | 3 |
| 177 | 3 |
| 305 | 3 |
| 208 | 3 |
| 214 | 3 |
| 285 | 3 |
| 308 | 3 |
| 310 | 3 |
| 328 | 3 |
| 213 | 3 |
| 327 | 3 |
| 239 | 3 |
| 216 | 3 |
| 160 | 3 |
| 165 | 3 |
| 400 | 3 |
| 243 | 2 |
| 276 | 2 |
| 330 | 2 |
| 266 | 2 |
| 306 | 2 |
| 359 | 2 |
| 565 | 2 |
| 299 | 2 |
| 449 | 2 |
| 313 | 2 |
| 225 | 2 |
| 309 | 2 |
| 267 | 2 |
| 246 | 2 |
| 202 | 2 |
| 427 | 2 |
| 292 | 2 |
| 293 | 2 |
| 368 | 2 |
| 358 | 2 |
| 312 | 2 |
| 262 | 2 |
| 408 | 2 |
| 183 | 2 |
| 364 | 2 |
| 455 | 2 |
| 394 | 2 |
| 322 | 2 |
| 252 | 2 |
| 287 | 2 |
| 442 | 2 |
| 233 | 2 |
| 437 | 2 |
| 392 | 2 |
| 372 | 2 |
| 314 | 2 |
| 234 | 2 |
| 289 | 2 |
| 402 | 2 |
| 250 | 2 |
| 259 | 2 |
| 304 | 2 |
| 303 | 2 |
| 222 | 2 |
| 280 | 2 |
| 271 | 2 |
| 201 | 2 |
| 354 | 2 |
| 357 | 2 |
| 397 | 2 |
| 296 | 1 |
| 416 | 1 |
| 344 | 1 |
| 485 | 1 |
| 383 | 1 |
| 373 | 1 |
| 403 | 1 |
| 553 | 1 |
| 210 | 1 |
| 321 | 1 |
| 508 | 1 |
| 286 | 1 |
| 500 | 1 |
| 429 | 1 |
| 265 | 1 |
| 206 | 1 |
| 378 | 1 |
| 431 | 1 |
| 195 | 1 |
| 410 | 1 |
| 273 | 1 |
| 590 | 1 |
| 612 | 1 |
| 283 | 1 |
| 550 | 1 |
| 581 | 1 |
| 571 | 1 |
| 405 | 1 |
| 302 | 1 |
| 577 | 1 |
| 281 | 1 |
| 481 | 1 |
| 173 | 1 |
| 355 | 1 |
| 509 | 1 |
| 464 | 1 |
| 278 | 1 |
| 587 | 1 |
| 353 | 1 |
| 351 | 1 |
| 258 | 1 |
| 253 | 1 |
| 345 | 1 |
| 300 | 1 |
| 381 | 1 |
| 415 | 1 |
| 461 | 1 |
| 467 | 1 |
| 496 | 1 |
| 522 | 1 |
| 475 | 1 |
| 343 | 1 |
| 398 | 1 |
| 466 | 1 |
| 535 | 1 |
| 419 | 1 |
| 589 | 1 |
| 360 | 1 |
| 389 | 1 |
| 334 | 1 |
| 336 | 1 |
| 284 | 1 |
| 337 | 1 |
| 567 | 1 |
| 601 | 1 |
| 382 | 1 |
| 617 | 1 |
| 433 | 1 |
| 421 | 1 |
| 483 | 1 |
| 391 | 1 |
| 237 | 1 |
| 318 | 1 |
| 311 | 1 |
| 446 | 1 |
| 290 | 1 |
| 406 | 1 |
| 385 | 1 |
| 635 | 1 |
| 235 | 1 |
| 295 | 1 |
| 331 | 1 |
| 582 | 1 |
| 396 | 1 |
| 477 | 1 |
| 432 | 1 |
| 452 | 1 |
| 524 | 1 |
| 412 | 1 |
| 191 | 1 |
| 470 | 1 |
| 374 | 1 |
| 569 | 1 |
| 298 | 1 |
| 326 | 1 |
| 547 | 1 |
| 458 | 1 |
| 505 | 1 |
| 361 | 1 |
| 541 | 1 |
df.groupby("Personal_Loan")["Mortgage"].value_counts()
| count | ||
|---|---|---|
| Personal_Loan | Mortgage | |
| 0 | 0 | 3150 |
| 98 | 17 | |
| 83 | 16 | |
| 103 | 16 | |
| 89 | 16 | |
| 102 | 15 | |
| 91 | 15 | |
| 90 | 15 | |
| 119 | 15 | |
| 78 | 15 | |
| 131 | 14 | |
| 87 | 14 | |
| 104 | 14 | |
| 101 | 14 | |
| 94 | 13 | |
| 116 | 13 | |
| 112 | 13 | |
| 118 | 13 | |
| 109 | 13 | |
| 144 | 12 | |
| 100 | 12 | |
| 81 | 12 | |
| 185 | 12 | |
| 95 | 11 | |
| 121 | 11 | |
| 79 | 11 | |
| 76 | 11 | |
| 137 | 11 | |
| 153 | 11 | |
| 106 | 11 | |
| 97 | 11 | |
| 120 | 10 | |
| 117 | 10 | |
| 135 | 10 | |
| 113 | 10 | |
| 84 | 10 | |
| 108 | 10 | |
| 151 | 10 | |
| 158 | 10 | |
| 86 | 9 | |
| 128 | 9 | |
| 82 | 9 | |
| 184 | 9 | |
| 142 | 9 | |
| 161 | 9 | |
| 111 | 9 | |
| 159 | 9 | |
| 166 | 9 | |
| 115 | 9 | |
| 148 | 9 | |
| 146 | 8 | |
| 157 | 8 | |
| 174 | 8 | |
| 167 | 8 | |
| 218 | 8 | |
| 207 | 8 | |
| 169 | 8 | |
| 164 | 8 | |
| 88 | 8 | |
| 105 | 8 | |
| 123 | 8 | |
| 122 | 8 | |
| 110 | 8 | |
| 114 | 8 | |
| 132 | 7 | |
| 221 | 7 | |
| 80 | 7 | |
| 75 | 7 | |
| 96 | 7 | |
| 196 | 7 | |
| 170 | 7 | |
| 230 | 7 | |
| 154 | 7 | |
| 205 | 7 | |
| 149 | 7 | |
| 147 | 7 | |
| 219 | 7 | |
| 85 | 7 | |
| 194 | 7 | |
| 140 | 7 | |
| 126 | 7 | |
| 138 | 7 | |
| 127 | 7 | |
| 129 | 7 | |
| 193 | 7 | |
| 124 | 7 | |
| 187 | 6 | |
| 182 | 6 | |
| 204 | 6 | |
| 236 | 6 | |
| 136 | 6 | |
| 145 | 6 | |
| 107 | 6 | |
| 125 | 6 | |
| 130 | 6 | |
| 155 | 6 | |
| 141 | 6 | |
| 249 | 6 | |
| 220 | 5 | |
| 192 | 5 | |
| 256 | 5 | |
| 217 | 5 | |
| 188 | 5 | |
| 134 | 5 | |
| 163 | 5 | |
| 240 | 5 | |
| 180 | 5 | |
| 172 | 5 | |
| 178 | 5 | |
| 156 | 5 | |
| 229 | 5 | |
| 150 | 5 | |
| 241 | 4 | |
| 272 | 4 | |
| 245 | 4 | |
| 232 | 4 | |
| 212 | 4 | |
| 247 | 4 | |
| 227 | 4 | |
| 224 | 4 | |
| 199 | 4 | |
| 251 | 4 | |
| 200 | 4 | |
| 244 | 4 | |
| 99 | 4 | |
| 171 | 4 | |
| 181 | 4 | |
| 189 | 4 | |
| 139 | 4 | |
| 190 | 4 | |
| 152 | 4 | |
| 93 | 4 | |
| 92 | 4 | |
| 162 | 4 | |
| 77 | 4 | |
| 179 | 4 | |
| 160 | 3 | |
| 248 | 3 | |
| 268 | 3 | |
| 242 | 3 | |
| 143 | 3 | |
| 133 | 3 | |
| 257 | 3 | |
| 263 | 3 | |
| 264 | 3 | |
| 198 | 3 | |
| 270 | 3 | |
| 239 | 3 | |
| 275 | 3 | |
| 285 | 3 | |
| 294 | 3 | |
| 297 | 3 | |
| 307 | 3 | |
| 328 | 3 | |
| 333 | 3 | |
| 341 | 3 | |
| 366 | 3 | |
| 165 | 3 | |
| 323 | 3 | |
| 238 | 3 | |
| 175 | 3 | |
| 208 | 3 | |
| 223 | 3 | |
| 177 | 3 | |
| 168 | 3 | |
| 226 | 3 | |
| 209 | 3 | |
| 228 | 3 | |
| 211 | 3 | |
| 186 | 3 | |
| 203 | 3 | |
| 214 | 3 | |
| 215 | 3 | |
| 231 | 3 | |
| 310 | 2 | |
| 213 | 2 | |
| 306 | 2 | |
| 325 | 2 | |
| 201 | 2 | |
| 322 | 2 | |
| 319 | 2 | |
| 315 | 2 | |
| 308 | 2 | |
| 314 | 2 | |
| 313 | 2 | |
| 329 | 2 | |
| 330 | 2 | |
| 408 | 2 | |
| 402 | 2 | |
| 455 | 2 | |
| 397 | 2 | |
| 394 | 2 | |
| 392 | 2 | |
| 380 | 2 | |
| 352 | 2 | |
| 202 | 2 | |
| 303 | 2 | |
| 305 | 2 | |
| 309 | 2 | |
| 262 | 2 | |
| 260 | 2 | |
| 271 | 2 | |
| 222 | 2 | |
| 267 | 2 | |
| 225 | 2 | |
| 176 | 2 | |
| 277 | 2 | |
| 255 | 2 | |
| 252 | 2 | |
| 250 | 2 | |
| 233 | 2 | |
| 243 | 2 | |
| 276 | 2 | |
| 197 | 2 | |
| 216 | 2 | |
| 293 | 2 | |
| 289 | 2 | |
| 287 | 2 | |
| 280 | 2 | |
| 311 | 1 | |
| 355 | 1 | |
| 326 | 1 | |
| 327 | 1 | |
| 331 | 1 | |
| 334 | 1 | |
| 336 | 1 | |
| 337 | 1 | |
| 342 | 1 | |
| 344 | 1 | |
| 345 | 1 | |
| 353 | 1 | |
| 354 | 1 | |
| 357 | 1 | |
| 398 | 1 | |
| 358 | 1 | |
| 359 | 1 | |
| 360 | 1 | |
| 361 | 1 | |
| 364 | 1 | |
| 368 | 1 | |
| 381 | 1 | |
| 382 | 1 | |
| 383 | 1 | |
| 385 | 1 | |
| 389 | 1 | |
| 321 | 1 | |
| 312 | 1 | |
| 304 | 1 | |
| 302 | 1 | |
| 173 | 1 | |
| 183 | 1 | |
| 191 | 1 | |
| 195 | 1 | |
| 206 | 1 | |
| 234 | 1 | |
| 235 | 1 | |
| 246 | 1 | |
| 258 | 1 | |
| 259 | 1 | |
| 265 | 1 | |
| 266 | 1 | |
| 273 | 1 | |
| 278 | 1 | |
| 283 | 1 | |
| 284 | 1 | |
| 286 | 1 | |
| 290 | 1 | |
| 292 | 1 | |
| 296 | 1 | |
| 298 | 1 | |
| 299 | 1 | |
| 300 | 1 | |
| 396 | 1 | |
| 400 | 1 | |
| 565 | 1 | |
| 601 | 1 | |
| 635 | 1 | |
| 403 | 1 | |
| 553 | 1 | |
| 550 | 1 | |
| 524 | 1 | |
| 509 | 1 | |
| 405 | 1 | |
| 406 | 1 | |
| 415 | 1 | |
| 419 | 1 | |
| 422 | 1 | |
| 427 | 1 | |
| 428 | 1 | |
| 429 | 1 | |
| 433 | 1 | |
| 437 | 1 | |
| 442 | 1 | |
| 449 | 1 | |
| 452 | 1 | |
| 458 | 1 | |
| 461 | 1 | |
| 464 | 1 | |
| 470 | 1 | |
| 475 | 1 | |
| 481 | 1 | |
| 496 | 1 | |
| 500 | 1 | |
| 505 | 1 | |
| 508 | 1 | |
| 1 | 0 | 312 |
| 301 | 5 | |
| 342 | 3 | |
| 282 | 3 | |
| 422 | 2 | |
| 428 | 2 | |
| 86 | 2 | |
| 221 | 2 | |
| 307 | 2 | |
| 294 | 2 | |
| 327 | 2 | |
| 251 | 2 | |
| 372 | 2 | |
| 229 | 2 | |
| 400 | 2 | |
| 184 | 2 | |
| 209 | 2 | |
| 149 | 2 | |
| 198 | 2 | |
| 106 | 2 | |
| 115 | 2 | |
| 99 | 2 | |
| 111 | 2 | |
| 120 | 2 | |
| 295 | 1 | |
| 167 | 1 | |
| 266 | 1 | |
| 272 | 1 | |
| 277 | 1 | |
| 281 | 1 | |
| 304 | 1 | |
| 175 | 1 | |
| 205 | 1 | |
| 299 | 1 | |
| 171 | 1 | |
| 297 | 1 | |
| 211 | 1 | |
| 170 | 1 | |
| 292 | 1 | |
| 260 | 1 | |
| 253 | 1 | |
| 259 | 1 | |
| 255 | 1 | |
| 197 | 1 | |
| 194 | 1 | |
| 213 | 1 | |
| 203 | 1 | |
| 216 | 1 | |
| 227 | 1 | |
| 192 | 1 | |
| 204 | 1 | |
| 231 | 1 | |
| 234 | 1 | |
| 237 | 1 | |
| 183 | 1 | |
| 240 | 1 | |
| 246 | 1 | |
| 249 | 1 | |
| 176 | 1 | |
| 212 | 1 | |
| 210 | 1 | |
| 374 | 1 | |
| 305 | 1 | |
| 522 | 1 | |
| 431 | 1 | |
| 432 | 1 | |
| 437 | 1 | |
| 442 | 1 | |
| 446 | 1 | |
| 449 | 1 | |
| 466 | 1 | |
| 467 | 1 | |
| 477 | 1 | |
| 483 | 1 | |
| 485 | 1 | |
| 535 | 1 | |
| 421 | 1 | |
| 541 | 1 | |
| 590 | 1 | |
| 547 | 1 | |
| 565 | 1 | |
| 567 | 1 | |
| 569 | 1 | |
| 612 | 1 | |
| 571 | 1 | |
| 161 | 1 | |
| 581 | 1 | |
| 577 | 1 | |
| 427 | 1 | |
| 416 | 1 | |
| 308 | 1 | |
| 354 | 1 | |
| 310 | 1 | |
| 312 | 1 | |
| 315 | 1 | |
| 318 | 1 | |
| 319 | 1 | |
| 325 | 1 | |
| 329 | 1 | |
| 587 | 1 | |
| 343 | 1 | |
| 351 | 1 | |
| 352 | 1 | |
| 582 | 1 | |
| 412 | 1 | |
| 357 | 1 | |
| 358 | 1 | |
| 359 | 1 | |
| 364 | 1 | |
| 368 | 1 | |
| 373 | 1 | |
| 378 | 1 | |
| 380 | 1 | |
| 391 | 1 | |
| 589 | 1 | |
| 410 | 1 | |
| 163 | 1 | |
| 158 | 1 | |
| 150 | 1 | |
| 97 | 1 | |
| 75 | 1 | |
| 76 | 1 | |
| 81 | 1 | |
| 82 | 1 | |
| 84 | 1 | |
| 88 | 1 | |
| 91 | 1 | |
| 94 | 1 | |
| 118 | 1 | |
| 119 | 1 | |
| 121 | 1 | |
| 123 | 1 | |
| 132 | 1 | |
| 134 | 1 | |
| 138 | 1 | |
| 140 | 1 | |
| 142 | 1 | |
| 144 | 1 | |
| 146 | 1 | |
| 147 | 1 | |
| 617 | 1 |
- Mortgage can be restructed as Yes/No for better analysis.
- It is also evident that personal loan is '0' for all mortgage '0' values.
- We have seen that mortgage is highly skewed towards '0'
#user defined function to convert mortgage to a boolean data
def mort(val):
if val == 0:
return "No"
else:
return "Yes"
Mortgage = df["Mortgage"].apply(mort)
Mortgage.value_counts()
df["Mortgage"] = df["Mortgage"].apply(mort)
df["Mortgage"].value_counts()
| count | |
|---|---|
| Mortgage | |
| No | 3462 |
| Yes | 1538 |
#Analyse the customer have mortgage to accept personal loan by yes or no
df.groupby("Mortgage")["Personal_Loan"].value_counts()
| count | ||
|---|---|---|
| Mortgage | Personal_Loan | |
| No | 0 | 3150 |
| 1 | 312 | |
| Yes | 0 | 1370 |
| 1 | 168 |
There are 480 customers have chance to take personal loan out of 5000 customers with the bank.
df_dtree = df.copy()
df_dtree.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | No | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | No | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | No | 0 | 0 | 0 | 0 | 0 | Alameda County |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | No | 0 | 0 | 0 | 0 | 0 | San Francisco County |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | No | 0 | 0 | 0 | 0 | 1 | Los Angeles County |
Outlier treatment¶
#creating a list of numerical columns
numerical_features = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
#Boxplots of numerical columns to view the outliers
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1)
sns.boxplot(x=df[feature])
plt.title(f'Boxplot of {feature}')
plt.xlabel(feature)
plt.tight_layout()
plt.show()
- The outliers are present in income,CCAvg and mortgage.
#Creating a list of numerical columns
numerical_features = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']
# User defined function to floor and cap the outliers
def treat_outliers(df, col):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = df[col].quantile(0.25) # 1st quantile
Q3 = df[col].quantile(0.75) # 3rd quantile
IQR = Q3 - Q1
Lower_whisker = Q1 - 1.5 * IQR
Upper_whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_whisker, Upper_whisker)
return df
def treat_outliers_all(df, col_list):
"""
treat outlier in all numerical variables
col_list: list of numerical variables
df: data frame
"""
for c in col_list:
df = treat_outliers(df, c)
return df
# Assuming 'Mortgage_Yes/No' is the correct column name
# Convert 'Yes'/'No' to 1/0 and create a new 'Mortgage' column if it doesn't exist.
# If 'Mortgage' already exists (e.g., as a numeric type), this line won't change it.
df['Mortgage'] = df['Mortgage'].map({'Yes': 1, 'No': 0}).astype(float)
#Creating a list of numerical columns
numerical_features = ['Age', 'Experience', 'Income', 'CCAvg']
df['Mortgage'] = df['Mortgage'].map({'Yes': 1, 'No': 0}).astype(float)
#Treating the outliers for numerical features ONLY
df = treat_outliers_all(df, numerical_features)
#Plotting the boxplot to check if the outliers are fixed
plt.figure(figsize=(15, 10))
for i, variable in enumerate(col1):
plt.subplot(2, 3, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
There is no outliers are present; it has been fixed now.
#Categorizing datatypes
df["County"] = df["County"].astype("category")
df["Mortgage"] = df["Mortgage"].astype("category")
df["Education"] = df["Education"].astype("category")
df.info()
- The data processing has been done.
- Next procedure is to drop the unwanted columns before proceeding with model building
#dropping unwanted columns
df.drop([ 'Mortgage'], axis = 1, inplace=True)
# Corrected the column name to 'Mortgage_Yes/No'
#Final dataset
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49.0 | 91107 | 4 | 1.6 | 1 | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 1 | 2 | 45 | 19 | 34.0 | 90089 | 3 | 1.5 | 1 | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 2 | 3 | 39 | 15 | 11.0 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | Alameda County |
| 3 | 4 | 35 | 9 | 100.0 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | San Francisco County |
| 4 | 5 | 35 | 8 | 45.0 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 1 | Los Angeles County |
#Making a copy of the data for logistic regression model building
df_log = df.copy()
Key insights from EDA
The results shows from the EDA are maximum number of customers who opted for a personal loan are the following: *They are between the ages of 35 and 55.
- Have a minimum experience of 9 years.
- Have a minimum annual average income above 120000 dollars.
- Have a fsmily size of 3+.
- They own a CD account with the bank
- Have an education level of two and above.
- Spend an average of $25K on creditcards every month
- They do have a mortgage.
- Avail online banking facilities
- These customers are from the Los Angeles,San Diego or Santa clara counties.
Model Building¶
Model Building - Logistic Regression
# defining the independent variable
X = df.drop(columns=['ID', 'Personal_Loan'])
df_log.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49.0 | 91107 | 4 | 1.6 | 1 | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 1 | 2 | 45 | 19 | 34.0 | 90089 | 3 | 1.5 | 1 | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 2 | 3 | 39 | 15 | 11.0 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | Alameda County |
| 3 | 4 | 35 | 9 | 100.0 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | San Francisco County |
| 4 | 5 | 35 | 8 | 45.0 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 1 | Los Angeles County |
This data shows that the ages between 25-45 are more likely to take personal loan.
def perc_on_bar(ax, df):
"""
drop the unwanted columns before
proceed with model building
"""
for p in ax.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
X = df.drop(columns=['ID', 'Personal_Loan'])
y = df['Personal_Loan']
# fitting the model to the training data
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd # You likely already have this imported, but it's best to be explicit
#Creating dummy variable for categorical features
X = pd.get_dummies(X, drop_first=True)
#convert categorical data into numerical data
X=pd.get_dummies(X, columns=X.filter(items=["Age", "Experience", "Income", "CC_Avg", "Zipcode", "Family", "Mortgage", "Education"]).columns.tolist(), drop_first=True)
X.head()
#splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
print("\n Percentage of clases in training set :")
print(y_train.value_counts(normalize=True))
print("\n Percentage of clases in test set :")
print(y_test.value_counts(normalize=True))
(3500, 260) (1500, 260) (3500,) (1500,) Percentage of clases in training set : Personal_Loan 0 0.907714 1 0.092286 Name: proportion, dtype: float64 Percentage of clases in test set : Personal_Loan 0 0.895333 1 0.104667 Name: proportion, dtype: float64
We have a well balanced classes of the dependant variable in both train and test sets.
#Creating training and test sets
logistic_regression_model = LogisticRegression()
logistic_regression_model.fit(X_train, y_train)
df_log.head()
X = df.drop(columns=['ID', 'Personal_Loan'])
y = df['Personal_Loan']
#creating dummy variables for categorical features
X = pd.get_dummies(X, drop_first=True)
#splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Model Evaluation Criterion¶
- *** Model can make wrong prediction such as***: 1. Predicting a customer will not borrow the personal loan but in reality the customer will borrow the loan. 2. Predicting a customer will borrow the personal loan but in reality the customer will not borrow the loan. The concerning prediction is the first prediction. AllLife Bank is looking to expand its loan business, every single customer who will borrow the personal loan is very important. Therefore, the mistakes in the first prediction has to be considerably low.
How to resolve false Negatives? We can maximized the Recall score inorder to resolve false negatives. greater the Recall score higher the chances of predicting the potential customers who will borrow the personal loan.
Model Building¶
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape((2, 2))
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Logistic Regression with sklearn library¶
#Using the newtown-cg solver as it is faster for high-dimential data
import warnings
warnings.filterwarnings('ignore')
lg = LogisticRegression(solver='newton-cg')
lg.fit(X_train, y_train)
LogisticRegression(solver='newton-cg')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(solver='newton-cg')
Model performance on training set
#Predicting on training set
y_pred_train = lg.predict(X_train)
print("Training set performance:\n")
print(model_performance_classification(lg, X_train, y_train))
print(confusion_matrix_sklearn(lg, X_train, y_train))
print("Recall: ", recall_score(y_train, y_pred_train))
print("Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Precision: ", precision_score(y_train, y_pred_train))
print("F1: ", f1_score(y_train, y_pred_train))
Training set performance: Accuracy Recall Precision F1 0 0.962 0.671827 0.889344 0.765432 None Recall: 0.6718266253869969 Accuracy: 0.962 Precision: 0.889344262295082 F1: 0.7654320987654321
print("Training set performance:\n")
print(model_performance_classification(lg, X_train, y_train))
print(confusion_matrix_sklearn(lg, X_train, y_train))
print("Recall: ", recall_score(y_train, y_pred_train))
print("Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Precision: ", precision_score(y_train, y_pred_train))
print("F1: ", f1_score(y_train, y_pred_train))
Training set performance: Accuracy Recall Precision F1 0 0.962 0.671827 0.889344 0.765432 None Recall: 0.6718266253869969 Accuracy: 0.962 Precision: 0.889344262295082 F1: 0.7654320987654321
print("Training set performance:\n")
print(model_performance_classification(lg, X_train, y_train))
print(confusion_matrix_sklearn(lg, X_train, y_train))
print("Recall: ", recall_score(y_train, y_pred_train))
print("Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Precision: ", precision_score(y_train, y_pred_train))
print("F1: ", f1_score(y_train, y_pred_train))
Training set performance: Accuracy Recall Precision F1 0 0.962 0.671827 0.889344 0.765432 None Recall: 0.6718266253869969 Accuracy: 0.962 Precision: 0.889344262295082 F1: 0.7654320987654321
print("Training set performance:\n")
print(model_performance_classification(lg, X_train, y_train))
print(confusion_matrix_sklearn(lg, X_train, y_train))
print("Recall: ", recall_score(y_train, y_pred_train))
print("Accuracy: ", accuracy_score(y_train, y_pred_train))
print("Precision: ", precision_score(y_train, y_pred_train))
print("F1: ", f1_score(y_train, y_pred_train))
Training set performance: Accuracy Recall Precision F1 0 0.962 0.671827 0.889344 0.765432 None Recall: 0.6718266253869969 Accuracy: 0.962 Precision: 0.889344262295082 F1: 0.7654320987654321
Model performance on test set
#Predicting on the test set
y_pred_test = lg.predict(X_test)
print("Test set performance:\n")
print(model_performance_classification(lg, X_test, y_test))
print(confusion_matrix_sklearn(lg, X_test, y_test))
print("Recall: ", recall_score(y_test, y_pred_test))
print("Accuracy: ", accuracy_score(y_test, y_pred_test))
print("Precision: ", precision_score(y_test, y_pred_test))
print("F1: ", f1_score(y_test, y_pred_test))
Test set performance: Accuracy Recall Precision F1 0 0.962 0.713376 0.903226 0.797153 None Recall: 0.7133757961783439 Accuracy: 0.962 Precision: 0.9032258064516129 F1: 0.797153024911032
Observations
- The training and testing recall scores are 0.616 and 0.62; same scores.
- The metric scores are looking good
- however,for checking the statstical validity of the model, we must buid a model using the statsmodels library
- Accuracy in the test model is only 1% lower than the training model.
Logistic Regression with statsmodels library
# Importing necessary libraries
import statsmodels.api as sm
import pandas as pd
from sklearn.model_selection import train_test_split
# Assuming 'df' is your original DataFrame containing the data
# If it's named something else, replace 'df' with the actual name
df_logit = df.copy() # Creating a copy of the original DataFrame and assigning it to df_logit
logistic_regression = sm.Logit(y_train, X_train.astype(float))
X = df_logit.drop("Personal_Loan", axis=1)
y = df_logit["Personal_Loan"]
X = pd.get_dummies(X, drop_first=True)
# adding constant
X = sm.add_constant(X)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42
)
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
from sklearn.model_selection import train_test_split
def logistic_regression(X, y):
# Ensure all columns in X are numeric
X = X.select_dtypes(include=['number']) # Select only numeric columns
# sm.Logit expects y to be a 1D array, not a Series
y = y.values
logistic_regression_model = sm.Logit(y, X)
result = logistic_regression_model.fit()
return result
# Assuming X and y are your original DataFrames
# Convert categorical columns in X to numeric using one-hot encoding
X = pd.get_dummies(X, drop_first=True)
# Add a constant to the independent variables
X = sm.add_constant(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
# Now call logistic regression function
result = logistic_regression(X_train, y_train)
print(result.summary())
Optimization terminated successfully.
Current function value: 0.153085
Iterations 9
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4000
Model: Logit Df Residuals: 3988
Method: MLE Df Model: 11
Date: Mon, 30 Sep 2024 Pseudo R-squ.: 0.5080
Time: 04:16:37 Log-Likelihood: -612.34
converged: True LL-Null: -1244.5
Covariance Type: nonrobust LLR p-value: 2.170e-264
======================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------
const -9.9667 4.192 -2.377 0.017 -18.183 -1.750
ID -3.403e-05 5.35e-05 -0.636 0.525 -0.000 7.08e-05
Age 0.2023 0.060 3.354 0.001 0.084 0.320
Experience -0.1917 0.060 -3.181 0.001 -0.310 -0.074
Income 0.0430 0.002 18.553 0.000 0.038 0.048
ZIPCode -4.953e-05 4.18e-05 -1.186 0.236 -0.000 3.23e-05
Family 0.9011 0.081 11.082 0.000 0.742 1.060
CCAvg 0.2176 0.051 4.260 0.000 0.117 0.318
Securities_Account -0.8885 0.289 -3.072 0.002 -1.455 -0.322
CD_Account 3.6631 0.320 11.449 0.000 3.036 4.290
Online -0.5494 0.161 -3.423 0.001 -0.864 -0.235
CreditCard -1.2038 0.214 -5.626 0.000 -1.623 -0.784
======================================================================================
#Fitting logistic regression model
logistic_regression = sm.Logit(y_train, X_train.astype(float))
logit = sm.Logit(y_train, X_train.astype(float))
result = logit.fit()
lg = logistic_regression.fit(disp=False)
print(result.summary())
Optimization terminated successfully.
Current function value: 0.111998
Iterations 9
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 4000
Model: Logit Df Residuals: 3968
Method: MLE Df Model: 31
Date: Mon, 30 Sep 2024 Pseudo R-squ.: 0.6400
Time: 04:16:41 Log-Likelihood: -447.99
converged: True LL-Null: -1244.5
Covariance Type: nonrobust LLR p-value: 0.000
================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------
const -9.5914 16.781 -0.572 0.568 -42.481 23.298
ID -6.751e-05 6.26e-05 -1.079 0.281 -0.000 5.51e-05
Age -0.0269 0.072 -0.373 0.709 -0.168 0.115
Experience 0.0390 0.072 0.542 0.588 -0.102 0.180
Income 0.0617 0.004 17.574 0.000 0.055 0.069
ZIPCode -4.284e-05 0.000 -0.242 0.809 -0.000 0.000
Family 0.7336 0.090 8.122 0.000 0.557 0.911
CCAvg 0.3777 0.067 5.646 0.000 0.247 0.509
Securities_Account -0.8681 0.345 -2.515 0.012 -1.545 -0.192
CD_Account 3.9798 0.398 9.997 0.000 3.200 4.760
Online -0.7646 0.193 -3.969 0.000 -1.142 -0.387
CreditCard -1.2444 0.248 -5.022 0.000 -1.730 -0.759
Education_2 3.8901 0.308 12.611 0.000 3.286 4.495
Education_3 3.9648 0.302 13.144 0.000 3.374 4.556
County_Contra Costa County 0.4712 0.761 0.619 0.536 -1.021 1.963
County_Kern County 1.0435 0.777 1.343 0.179 -0.480 2.567
County_Los Angeles County 0.4190 0.762 0.550 0.583 -1.075 1.913
County_Marin County 0.6331 0.962 0.658 0.510 -1.253 2.519
County_Monterey County 0.6314 0.626 1.009 0.313 -0.595 1.858
County_Orange County 0.2520 0.572 0.441 0.660 -0.869 1.373
County_Other County 0.4415 0.507 0.871 0.384 -0.553 1.436
County_Riverside County 1.5687 0.819 1.916 0.055 -0.036 3.173
County_Sacramento County 0.2812 0.628 0.448 0.654 -0.950 1.512
County_San Bernardino County -0.3703 0.854 -0.434 0.665 -2.044 1.303
County_San Diego County 0.3633 0.591 0.615 0.539 -0.795 1.522
County_San Francisco County 0.3970 0.522 0.761 0.447 -0.625 1.419
County_San Mateo County -1.1638 0.637 -1.826 0.068 -2.413 0.086
County_Santa Barbara County 0.4375 0.675 0.648 0.517 -0.886 1.761
County_Santa Clara County 0.6041 0.386 1.565 0.118 -0.152 1.361
County_Santa Cruz County -0.1973 0.788 -0.250 0.802 -1.742 1.347
County_Ventura County 0.5789 0.748 0.773 0.439 -0.888 2.046
County_Yolo County 0.0763 0.649 0.118 0.906 -1.196 1.349
================================================================================================
Possibly complete quasi-separation: A fraction 0.13 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
#Fitting logistic regresaion model
logistic_regression = sm.Logit(y_train, X_train.astype(float))
logit = sm.Logit(y_train, X_train.astype(float))
result = logit.fit()
lg = logistic_regression.fit(disp=False)
print(result.summary())
Optimization terminated successfully.
Current function value: 0.111998
Iterations 9
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 4000
Model: Logit Df Residuals: 3968
Method: MLE Df Model: 31
Date: Mon, 30 Sep 2024 Pseudo R-squ.: 0.6400
Time: 04:16:44 Log-Likelihood: -447.99
converged: True LL-Null: -1244.5
Covariance Type: nonrobust LLR p-value: 0.000
================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------
const -9.5914 16.781 -0.572 0.568 -42.481 23.298
ID -6.751e-05 6.26e-05 -1.079 0.281 -0.000 5.51e-05
Age -0.0269 0.072 -0.373 0.709 -0.168 0.115
Experience 0.0390 0.072 0.542 0.588 -0.102 0.180
Income 0.0617 0.004 17.574 0.000 0.055 0.069
ZIPCode -4.284e-05 0.000 -0.242 0.809 -0.000 0.000
Family 0.7336 0.090 8.122 0.000 0.557 0.911
CCAvg 0.3777 0.067 5.646 0.000 0.247 0.509
Securities_Account -0.8681 0.345 -2.515 0.012 -1.545 -0.192
CD_Account 3.9798 0.398 9.997 0.000 3.200 4.760
Online -0.7646 0.193 -3.969 0.000 -1.142 -0.387
CreditCard -1.2444 0.248 -5.022 0.000 -1.730 -0.759
Education_2 3.8901 0.308 12.611 0.000 3.286 4.495
Education_3 3.9648 0.302 13.144 0.000 3.374 4.556
County_Contra Costa County 0.4712 0.761 0.619 0.536 -1.021 1.963
County_Kern County 1.0435 0.777 1.343 0.179 -0.480 2.567
County_Los Angeles County 0.4190 0.762 0.550 0.583 -1.075 1.913
County_Marin County 0.6331 0.962 0.658 0.510 -1.253 2.519
County_Monterey County 0.6314 0.626 1.009 0.313 -0.595 1.858
County_Orange County 0.2520 0.572 0.441 0.660 -0.869 1.373
County_Other County 0.4415 0.507 0.871 0.384 -0.553 1.436
County_Riverside County 1.5687 0.819 1.916 0.055 -0.036 3.173
County_Sacramento County 0.2812 0.628 0.448 0.654 -0.950 1.512
County_San Bernardino County -0.3703 0.854 -0.434 0.665 -2.044 1.303
County_San Diego County 0.3633 0.591 0.615 0.539 -0.795 1.522
County_San Francisco County 0.3970 0.522 0.761 0.447 -0.625 1.419
County_San Mateo County -1.1638 0.637 -1.826 0.068 -2.413 0.086
County_Santa Barbara County 0.4375 0.675 0.648 0.517 -0.886 1.761
County_Santa Clara County 0.6041 0.386 1.565 0.118 -0.152 1.361
County_Santa Cruz County -0.1973 0.788 -0.250 0.802 -1.742 1.347
County_Ventura County 0.5789 0.748 0.773 0.439 -0.888 2.046
County_Yolo County 0.0763 0.649 0.118 0.906 -1.196 1.349
================================================================================================
Possibly complete quasi-separation: A fraction 0.13 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Observations
In this dataset, negative coefficient values indicate that as the corresponding attribute values increase, the probability of borrowing a personal loan decreases. Conversely, positive coefficients suggest that an increase in attribute values leads to a higher probability of borrowing. It is essential to address multicollinearity to obtain reliable coefficients and p-values. The p-value measures the significance of variables, with values below 0.05 indicating a statistically significant relationship that meets the 0.05 significance threshold.
Test for Multicollinearity - variation inflation factor¶
# Convert all columns in X_train to numeric if possible
# If a column cannot be converted, it will be filled with NaN
X_train_numeric = X_train.apply(pd.to_numeric, errors='coerce')
# Drop columns with any non-finite values (NaNs, infinities)
# This is crucial to ensure all data is numeric and finite
X_train_numeric = X_train_numeric.dropna(axis=1)
# Reset index after dropping columns for consistency
X_train_numeric = X_train_numeric.reset_index(drop=True)
# Select only numerical features
numerical_features = X_train_numeric.select_dtypes(include=np.number).columns
X_train_numeric = X_train_numeric[numerical_features]
# Calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_series = pd.Series(
[variance_inflation_factor(X_train_numeric.values, i) for i in range(X_train_numeric.shape[1])],
index=X_train_numeric.columns,
dtype=float,
).sort_values(ascending=False)
print("Series after feature selection : \n\n{}\n".format(vif_series))
Series after feature selection : const 3256.559374 Experience 84.036119 Age 84.032422 Income 1.731848 CCAvg 1.678973 CD_Account 1.315546 Securities_Account 1.141553 CreditCard 1.103630 Online 1.042650 Family 1.038721 ZIPCode 1.005523 ID 1.002107 dtype: float64
The VIF score of both age and experience are same. This is indicating that these variables are collinear.
Removing Multicollinearity¶
#removing age
# Create a copy of X_train to avoid modifying the original DataFrame
X_train1 = X_train.copy()
#check if 'Age' column exists before dropping
if 'Age' in X_train1.columns:
# Drop the 'Age' column from the copy
X_train1 = X_train.drop("Age", axis=1)
else:
print("'Age' column not found in X_train1.")
# Apply the same numeric preprocessing to X_train1 as you did for X_train
# Convert all columns in X_train1 to numeric if possible
X_train_numeric = X_train1.apply(pd.to_numeric, errors='coerce')
# Drop columns with any non-finite values (NaNs, infinities)
X_train_numeric = X_train_numeric.dropna(axis=1)
X_train_numeric = X_train_numeric.reset_index(drop=True)
# Select only numerical features
numerical_features = X_train_numeric.select_dtypes(include=np.number).columns
X_train_numeric = X_train_numeric[numerical_features]
# Calculate VIF for the modified DataFrames
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_series_1 = pd.Series(
[variance_inflation_factor(X_train_numeric.values, i) for i in range(X_train_numeric.shape[1])],
index=X_train_numeric.columns,
dtype=float,
).sort_values(ascending=False)
print("Series after feature selection : \n\n{}\n".format(vif_series_1)) # Print vif_series_1 here
Series after feature selection : const 2812.921597 Income 1.717686 CCAvg 1.674714 CD_Account 1.315475 Securities_Account 1.141551 CreditCard 1.103354 Online 1.042649 Family 1.037715 Experience 1.008569 ZIPCode 1.005307 ID 1.002055 dtype: float64
Now, we can proceed with model building since all the variables have vif<5.
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
logistic_regression_1 = sm.Logit(y_train, X_train.astype(float))
lg1 = logistic_regression_1.fit(disp=False)
print(lg1.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 4000
Model: Logit Df Residuals: 3968
Method: MLE Df Model: 31
Date: Mon, 30 Sep 2024 Pseudo R-squ.: 0.6400
Time: 04:16:56 Log-Likelihood: -447.99
converged: True LL-Null: -1244.5
Covariance Type: nonrobust LLR p-value: 0.000
================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------
const -9.5914 16.781 -0.572 0.568 -42.481 23.298
ID -6.751e-05 6.26e-05 -1.079 0.281 -0.000 5.51e-05
Age -0.0269 0.072 -0.373 0.709 -0.168 0.115
Experience 0.0390 0.072 0.542 0.588 -0.102 0.180
Income 0.0617 0.004 17.574 0.000 0.055 0.069
ZIPCode -4.284e-05 0.000 -0.242 0.809 -0.000 0.000
Family 0.7336 0.090 8.122 0.000 0.557 0.911
CCAvg 0.3777 0.067 5.646 0.000 0.247 0.509
Securities_Account -0.8681 0.345 -2.515 0.012 -1.545 -0.192
CD_Account 3.9798 0.398 9.997 0.000 3.200 4.760
Online -0.7646 0.193 -3.969 0.000 -1.142 -0.387
CreditCard -1.2444 0.248 -5.022 0.000 -1.730 -0.759
Education_2 3.8901 0.308 12.611 0.000 3.286 4.495
Education_3 3.9648 0.302 13.144 0.000 3.374 4.556
County_Contra Costa County 0.4712 0.761 0.619 0.536 -1.021 1.963
County_Kern County 1.0435 0.777 1.343 0.179 -0.480 2.567
County_Los Angeles County 0.4190 0.762 0.550 0.583 -1.075 1.913
County_Marin County 0.6331 0.962 0.658 0.510 -1.253 2.519
County_Monterey County 0.6314 0.626 1.009 0.313 -0.595 1.858
County_Orange County 0.2520 0.572 0.441 0.660 -0.869 1.373
County_Other County 0.4415 0.507 0.871 0.384 -0.553 1.436
County_Riverside County 1.5687 0.819 1.916 0.055 -0.036 3.173
County_Sacramento County 0.2812 0.628 0.448 0.654 -0.950 1.512
County_San Bernardino County -0.3703 0.854 -0.434 0.665 -2.044 1.303
County_San Diego County 0.3633 0.591 0.615 0.539 -0.795 1.522
County_San Francisco County 0.3970 0.522 0.761 0.447 -0.625 1.419
County_San Mateo County -1.1638 0.637 -1.826 0.068 -2.413 0.086
County_Santa Barbara County 0.4375 0.675 0.648 0.517 -0.886 1.761
County_Santa Clara County 0.6041 0.386 1.565 0.118 -0.152 1.361
County_Santa Cruz County -0.1973 0.788 -0.250 0.802 -1.742 1.347
County_Ventura County 0.5789 0.748 0.773 0.439 -0.888 2.046
County_Yolo County 0.0763 0.649 0.118 0.906 -1.196 1.349
================================================================================================
Possibly complete quasi-separation: A fraction 0.13 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Removing variables with P-value>0.05
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Assuming X_train is your original DataFrame
X_train2 = X_train.copy() # Create a copy of X_train and assign it to X_train2
# Replace infinite values with NaN
X_train2 = X_train2.replace([np.inf, -np.inf], np.nan)
#Removing Experience
X_train2 = X_train1.drop("Experience", axis=1).replace([np.inf, -np.inf], np.nan)
# Drop rows with any missing values
X_train2 = X_train2.dropna()
# Ensure y_train has the same index as X_train2 after dropping rows
y_train2 = y_train[y_train.index.isin(X_train2.index)]
logistic_regression_2 = sm.Logit(y_train, X_train.astype(float))
lg2 = logistic_regression_1.fit(disp=False)
print(lg2.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 4000
Model: Logit Df Residuals: 3968
Method: MLE Df Model: 31
Date: Mon, 30 Sep 2024 Pseudo R-squ.: 0.6400
Time: 04:17:02 Log-Likelihood: -447.99
converged: True LL-Null: -1244.5
Covariance Type: nonrobust LLR p-value: 0.000
================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------
const -9.5914 16.781 -0.572 0.568 -42.481 23.298
ID -6.751e-05 6.26e-05 -1.079 0.281 -0.000 5.51e-05
Age -0.0269 0.072 -0.373 0.709 -0.168 0.115
Experience 0.0390 0.072 0.542 0.588 -0.102 0.180
Income 0.0617 0.004 17.574 0.000 0.055 0.069
ZIPCode -4.284e-05 0.000 -0.242 0.809 -0.000 0.000
Family 0.7336 0.090 8.122 0.000 0.557 0.911
CCAvg 0.3777 0.067 5.646 0.000 0.247 0.509
Securities_Account -0.8681 0.345 -2.515 0.012 -1.545 -0.192
CD_Account 3.9798 0.398 9.997 0.000 3.200 4.760
Online -0.7646 0.193 -3.969 0.000 -1.142 -0.387
CreditCard -1.2444 0.248 -5.022 0.000 -1.730 -0.759
Education_2 3.8901 0.308 12.611 0.000 3.286 4.495
Education_3 3.9648 0.302 13.144 0.000 3.374 4.556
County_Contra Costa County 0.4712 0.761 0.619 0.536 -1.021 1.963
County_Kern County 1.0435 0.777 1.343 0.179 -0.480 2.567
County_Los Angeles County 0.4190 0.762 0.550 0.583 -1.075 1.913
County_Marin County 0.6331 0.962 0.658 0.510 -1.253 2.519
County_Monterey County 0.6314 0.626 1.009 0.313 -0.595 1.858
County_Orange County 0.2520 0.572 0.441 0.660 -0.869 1.373
County_Other County 0.4415 0.507 0.871 0.384 -0.553 1.436
County_Riverside County 1.5687 0.819 1.916 0.055 -0.036 3.173
County_Sacramento County 0.2812 0.628 0.448 0.654 -0.950 1.512
County_San Bernardino County -0.3703 0.854 -0.434 0.665 -2.044 1.303
County_San Diego County 0.3633 0.591 0.615 0.539 -0.795 1.522
County_San Francisco County 0.3970 0.522 0.761 0.447 -0.625 1.419
County_San Mateo County -1.1638 0.637 -1.826 0.068 -2.413 0.086
County_Santa Barbara County 0.4375 0.675 0.648 0.517 -0.886 1.761
County_Santa Clara County 0.6041 0.386 1.565 0.118 -0.152 1.361
County_Santa Cruz County -0.1973 0.788 -0.250 0.802 -1.742 1.347
County_Ventura County 0.5789 0.748 0.773 0.439 -0.888 2.046
County_Yolo County 0.0763 0.649 0.118 0.906 -1.196 1.349
================================================================================================
Possibly complete quasi-separation: A fraction 0.13 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
import statsmodels.api as sm
# Initial list of columns
cols = X_train2.columns.tolist()
# Set an initial max p-value
max_p_value = 1
# Loop to drop features with high p-values
while len(cols) > 0:
# Define the training set with the current columns
X_train_aux = X_train2[cols]
X_train_aux = sm.add_constant(X_train_aux) # Add constant for intercept
# Fit the logistic regression model
model = sm.Logit(y_train2, X_train_aux.astype(float)).fit(disp=False)
# Get p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# Name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
# Check if max p-value is greater than 0.05
if max_p_value > 0.05:
cols.remove(feature_with_p_max) # Drop the feature
else:
break # Exit loop if all features are significant
# Selected features after dropping
selected_features = cols
print(selected_features)
['const', 'Income', 'Family', 'CCAvg', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'Education_2', 'Education_3', 'County_San Mateo County']
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_train3 = X_train2[selected_features]
logistic_regression_3 = sm.Logit(y_train, X_train3.astype(float))
lg3 = logistic_regression_3.fit()
print(lg3.summary())
Optimization terminated successfully.
Current function value: 0.113747
Iterations 9
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 4000
Model: Logit Df Residuals: 3989
Method: MLE Df Model: 10
Date: Mon, 30 Sep 2024 Pseudo R-squ.: 0.6344
Time: 04:17:12 Log-Likelihood: -454.99
converged: True LL-Null: -1244.5
Covariance Type: nonrobust LLR p-value: 0.000
===========================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------
const -13.5391 0.672 -20.144 0.000 -14.856 -12.222
Income 0.0610 0.003 17.920 0.000 0.054 0.068
Family 0.7071 0.089 7.990 0.000 0.534 0.881
CCAvg 0.3526 0.065 5.446 0.000 0.226 0.479
Securities_Account -0.8534 0.335 -2.548 0.011 -1.510 -0.197
CD_Account 3.9343 0.390 10.081 0.000 3.169 4.699
Online -0.7750 0.189 -4.110 0.000 -1.145 -0.405
CreditCard -1.2354 0.244 -5.071 0.000 -1.713 -0.758
Education_2 3.8187 0.301 12.667 0.000 3.228 4.410
Education_3 3.9173 0.292 13.427 0.000 3.345 4.489
County_San Mateo County -1.6013 0.564 -2.838 0.005 -2.707 -0.495
===========================================================================================
Possibly complete quasi-separation: A fraction 0.12 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Observations¶
No feature now have a p-value greater than 0.5.therefore,the features in X_train3 can be considered as the final ones and lg3 as the final model.
The coefficients of all the selected features are positive. Therefore, an increase in these will lead to an increase in the probability of a customer accepting the personal loan.
Converting coefficients to odds¶
#converting coeffficients to odds as they are in terms of log(odd)
odds = np.exp(lg3.params)
odds
#finding the percentage change
percentage_change = odds * 100 - 100
percentage_change
#adding the odds to a dataframe
odds_df = pd.DataFrame(odds, columns=["Odds"])
odds_df["Percentage Change"] = percentage_change
odds_df
| Odds | Percentage Change | |
|---|---|---|
| const | 0.000001 | -99.999868 |
| Income | 1.062851 | 6.285052 |
| Family | 2.028157 | 102.815750 |
| CCAvg | 1.422737 | 42.273701 |
| Securities_Account | 0.425946 | -57.405445 |
| CD_Account | 51.127910 | 5012.791026 |
| Online | 0.460703 | -53.929715 |
| CreditCard | 0.290704 | -70.929572 |
| Education_2 | 45.544678 | 4454.467760 |
| Education_3 | 50.263304 | 4926.330357 |
| County_San Mateo County | 0.201633 | -79.836729 |
Coefficient interpretations¶
- Features such as income,fanily,cc_avg, CD_account, Riversie county will increase the chance of taking personal loan
- Features such as securities_account,online,creditcard,San Mateo county will decrease the chance to take personal loan.
Model Performance Improvement¶
#defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
#Model0 performance on test set
model0 = DecisionTreeClassifier(random_state=1)
if (y == 1).any(): # Check if any value in y is equal to 1
model0 = DecisionTreeClassifier(random_state=1, class_weight="balanced")
else:
model0 = DecisionTreeClassifier(random_state=1)
X = df.drop(columns=['id', 'personal_loan'])
y = df['personal_loan']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train = X_train[~y_train.isnull()] # Select rows in X_train where y_train is not null
y_train = y_train.dropna()
model0 = DecisionTreeClassifier(random_state=1)
model0.fit(X_train, y_train)
model0
#convert categorical data to numerical data in a dummy variable
X=pd.get_dummies(X, columns=X.filter(items=["age", "experience", "income", "cc_avg", "zipcode", "family", "mortgage", "education"]).columns.tolist(), drop_first=True)
X.head()
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
model0 = DecisionTreeClassifier(random_state=1)
model0.fit(X_train, y_train)
print(confusion_matrix(y_train, model0.predict(X_train)))
print(confusion_matrix(y_test, model0.predict(X_test)))
[[3625 0] [ 0 375]] [[887 8] [ 7 98]]
decision_tree_classifier = model0
decision_tree_classifier
model0, X_train, y_train = decision_tree_classifier, X_train, y_train
decision_tree_perf_train = model_performance_classification(
model0, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
#Recall treatment
log_reg_model_test_perf = model_performance_classification(
model0, X_test, y_test
)
log_reg_model_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.985 | 0.933333 | 0.924528 | 0.92891 |
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
from sklearn.metrics import confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
predictors = X_test
target = y_test
model = model0
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
# Calculate percentages
percentages = ["{0:.2%}".format(item/cm.sum()) for item in cm.flatten()]
# Create labels array with same shape as cm
labels = np.asarray(percentages).reshape(cm.shape)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()
y_test = model0.predict(X_test)
print('Y_test.shape')
print(y_test.shape)
print(y_test)
confusion_matrix_sklearn(model0, X_test, y_test)
Y_test.shape (1000,) [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
model1 = DecisionTreeClassifier(random_state=42, class_weight="balanced")
model1.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', random_state=42)
model1 = DecisionTreeClassifier(random_state=42, class_weight="balanced")
model1.fit(X_train, y_train)
#calculate the pruning path
from sklearn.tree import _tree
cost_complexity_pruning_path = model1.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = cost_complexity_pruning_path.ccp_alphas, cost_complexity_pruning_path.impurities
ccp_alphas, impurities
(array([0.00000000e+00, 4.90029473e-19, 2.78704263e-18, 1.06581410e-17,
2.62012634e-17, 2.06057393e-16, 1.35135135e-04, 1.37221269e-04,
1.37845646e-04, 2.50000000e-04, 2.53995931e-04, 2.56091546e-04,
2.56880734e-04, 2.62295082e-04, 2.65842784e-04, 2.70270270e-04,
2.75113122e-04, 3.34314376e-04, 4.59443075e-04, 5.35885167e-04,
6.37304276e-04, 7.15822856e-04, 7.50000000e-04, 7.65231298e-04,
8.33595329e-04, 8.38147199e-04, 1.06035930e-03, 1.15076259e-03,
1.17711725e-03, 1.18558824e-03, 1.37136825e-03, 1.46056543e-03,
1.68674699e-03, 1.79020979e-03, 1.82608696e-03, 2.61000000e-03,
2.75098565e-03, 2.85978730e-03, 3.36153030e-03, 3.61896386e-03,
4.01666137e-03, 4.89426453e-03, 4.91632511e-03, 3.09771894e-02,
3.83996398e-02, 2.79838964e-01]),
array([-7.89948949e-15, -7.89899946e-15, -7.89621242e-15, -7.88555428e-15,
-7.85935302e-15, -7.65329562e-15, 2.70270270e-04, 5.44712809e-04,
8.20404101e-04, 1.07040410e-03, 1.83239189e-03, 3.36894117e-03,
4.65334484e-03, 4.91563992e-03, 6.51069662e-03, 6.78096689e-03,
7.33119314e-03, 1.00057081e-02, 1.13840374e-02, 1.19199225e-02,
1.25572268e-02, 1.39888725e-02, 1.47388725e-02, 1.55041038e-02,
1.63376992e-02, 1.80139936e-02, 2.01347122e-02, 2.12854747e-02,
2.24625920e-02, 2.48337685e-02, 2.62051367e-02, 2.91262676e-02,
3.08130146e-02, 3.26032244e-02, 3.44293113e-02, 3.70393113e-02,
4.25412826e-02, 4.82608572e-02, 5.16223875e-02, 5.52413514e-02,
5.92580127e-02, 6.41522773e-02, 7.39849275e-02, 1.04962117e-01,
2.20161036e-01, 5.00000000e-01]))
#convert numerical data to categorical data
X=pd.get_dummies(X, columns=X.filter(items=["age", "experience", "income", "cc_avg", "zipcode", "family", "mortgage", "education"]).columns.tolist(), drop_first=True)
confusion_matrix_sklearn(model1, X_train, y_train)
confusion_matrix_sklearn(model1, X_test, y_test)
decision_tree_perf_train = model_performance_classification(
model1, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
confusion_matrix_sklearn(model1, X_test, y_test)
Observations
- recall of model has increased but the other metric have reduced.
- The model is still performing well.
decision_tree_perf_test = model_performance_classification(
model1, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982 | 0.877358 | 0.94898 | 0.911765 |
#Recall treatment
log_reg_model_test_perf = model_performance_classification(
model1, X_test, y_test
)
log_reg_model_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982 | 0.877358 | 0.94898 | 0.911765 |
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 7, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]
# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Create a decision tree classifier with the current parameter values
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
random_state=42
)
# Fit the model to the training data
estimator.fit(X_train, y_train)
# Make predictions on the training and test sets
y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)
# creating an instance of the best model
model2 = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
random_state=42
)
model2 = best_estimator
# fitting the best model to the training data
from sklearn.tree import DecisionTreeClassifier
model2 = DecisionTreeClassifier(random_state=42)
model2.fit(X_train, y_train)
model2 # printing the best model
DecisionTreeClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=42)
decision_tree_tune_perf_test = model_performance_classification(
model2, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.997 | 0.981132 | 0.990476 | 0.985782 |
decision_tree_tune_perf_train = model_performance_classification(
model2, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
confusion_matrix_sklearn(model2, X_test, y_test)
Model performance on training set
decision_tree_tune_perf_test = model_performance_classification(
model2, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.997 | 0.981132 | 0.990476 | 0.985782 |
Model performance on test set
# recall treatment
log_reg_model_test_perf = model_performance_classification(
model2, X_test, y_test
)
log_reg_model_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.997 | 0.981132 | 0.990476 | 0.985782 |
conclusion
- The values are comparable across all three logistic regression models.
Model Building - decision Tree¶
#Viewing the dataset data_dtree that was created prior to the outlier treatment
df_dtree.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | No | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | No | 0 | 1 | 0 | 0 | 0 | Los Angeles County |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | No | 0 | 0 | 0 | 0 | 0 | Alameda County |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | No | 0 | 0 | 0 | 0 | 0 | San Francisco County |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | No | 0 | 0 | 0 | 0 | 1 | Los Angeles County |
X = df_dtree.drop(['Personal_Loan'], axis=1) #Creating independent variables
y = df_dtree["Personal_Loan"]
#Creating training and test sets
X_train, X_test, y_train, y_test = train_test_split(
df_dtree.drop("Personal_Loan", axis=1),
df_dtree["Personal_Loan"],
test_size=0.20,
random_state=42,
)
#creating dummy variables for categorical features
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("\n Percentage of classes in training set : ")
print(y_train.value_counts(normalize=True))
print("\n Percentage of classes in test set : ")
print(y_test.value_counts(normalize=True))
Shape of Training set : (4000, 33) Shape of test set : (1000, 33) Percentage of classes in training set : Personal_Loan 0 0.90625 1 0.09375 Name: proportion, dtype: float64 Percentage of classes in test set : Personal_Loan 0 0.895 1 0.105 Name: proportion, dtype: float64
We have well balanced classes of thh depedent variable in both train and test sets.
Model evaluation criterion¶
Model can make wrong prediction as : Predicting a customer will borrow the personal loan but in reality the customer does not borrow. Predicting a customer will not borrow the personal loan but in reality the customer will borrow the laon.
Prediction of concern : The second prediction is our major concern as the AllLife Bank is looking to expand it's loan business, every single customer who will borrow the personal loan is very important. Hence, mistakes in the second prediction (i.e. False Negatives) has to be considerably low How to reduce False Negatives? By recalling score should be maximized. Greater the Recall score higher the chances of predicting the potential customers who will borrow a personal loan
def model_performance_classification(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
#defining a function to plot the oncfusion_matrix of a classification model
def model_performance_classification(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# Building a Decision Tree using default 'gini' criteria to split
model = DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.15, 1: 0.85}, random_state=42
)
model.fit(X_train, y_train)
# defining higher weightage to 1 in the dependent variable
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=42)# Fitting the model on the training set
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=42)Model Performance on training set
confusion_matrix(y_train, model.predict(X_train))
import statsmodels.api as sm
def confision_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()
return
confision_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification(model, X_train, y_train)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
Observation The deicison tree is fully grown.Tehrefore the model is overfit and it is able to classify all the data poits on the training set with no errors.
Model Performance on test set
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()
return
confision
matrix_sklearn(model, X_test, y_test)
return
decision_tree_performance_test = model_performance_classification(model, X_test, y_test) # Replace y_est with y_test
decision_tree_performance_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.984 | 0.92381 | 0.92381 | 0.92381 |
Found disparity in model performance is due to overfitting
plt.figure(figsize=(12, 8))
# Assuming X_train is a pandas DataFrame containing your training data
# and you want to use its column names as feature names
feature_names = X_train.columns # Define feature_names here
out = tree.plot_tree(
model,
feature_names=feature_names, # Now feature_names is defined
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [424.80, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CCAvg <= 4.25 | | | |--- Income <= 82.50 | | | | |--- Experience <= 8.50 | | | | | |--- Family <= 3.50 | | | | | | |--- weights: [0.00, 4.25] class: 1 | | | | | |--- Family > 3.50 | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | |--- Experience > 8.50 | | | | | |--- ZIPCode <= 94657.50 | | | | | | |--- ZIPCode <= 91257.00 | | | | | | | |--- ID <= 1184.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- ID > 1184.50 | | | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | | |--- ZIPCode > 91257.00 | | | | | | | |--- weights: [6.75, 0.00] class: 0 | | | | | |--- ZIPCode > 94657.50 | | | | | | |--- Income <= 63.50 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | |--- Income > 63.50 | | | | | | | |--- Family <= 2.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- Family > 2.50 | | | | | | | | |--- Age <= 54.50 | | | | | | | | | |--- County_Alameda County <= 0.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- County_Alameda County > 0.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- Age > 54.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- Income > 82.50 | | | | |--- ID <= 934.50 | | | | | |--- County_Santa Clara County <= 0.50 | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | |--- County_Santa Clara County > 0.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- ID > 934.50 | | | | | |--- County_Santa Clara County <= 0.50 | | | | | | |--- County_Contra Costa County <= 0.50 | | | | | | | |--- County_Yolo County <= 0.50 | | | | | | | | |--- County_Orange County <= 0.50 | | | | | | | | | |--- County_Santa Barbara County <= 0.50 | | | | | | | | | | |--- County_Marin County <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- County_Marin County > 0.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- County_Santa Barbara County > 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- County_Orange County > 0.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- County_Yolo County > 0.50 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- County_Contra Costa County > 0.50 | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | |--- County_Santa Clara County > 0.50 | | | | | | |--- weights: [0.45, 0.00] class: 0 | | |--- CCAvg > 4.25 | | | |--- weights: [5.85, 0.00] class: 0 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 99.50 | | | | |--- CCAvg <= 4.20 | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | |--- CCAvg > 4.20 | | | | | |--- ZIPCode <= 90139.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- ZIPCode > 90139.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- Income > 99.50 | | | | |--- Income <= 103.50 | | | | | |--- Securities_Account <= 0.50 | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | |--- Securities_Account > 0.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Income > 103.50 | | | | | |--- County_Santa Clara County <= 0.50 | | | | | | |--- weights: [69.45, 0.00] class: 0 | | | | | |--- County_Santa Clara County > 0.50 | | | | | | |--- weights: [8.40, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 113.50 | | | | |--- Family <= 3.50 | | | | | |--- County_Orange County <= 0.50 | | | | | | |--- weights: [2.70, 0.00] class: 0 | | | | | |--- County_Orange County > 0.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Family > 3.50 | | | | | |--- CCAvg <= 4.15 | | | | | | |--- County_Los Angeles County <= 0.50 | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | |--- County_Los Angeles County > 0.50 | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | |--- CCAvg > 4.15 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- Income > 113.50 | | | | |--- weights: [0.00, 50.15] class: 1 | |--- Education > 1.50 | | |--- Income <= 114.50 | | | |--- CCAvg <= 2.45 | | | | |--- Income <= 106.50 | | | | | |--- weights: [5.10, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Experience <= 31.50 | | | | | | |--- Experience <= 3.50 | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | |--- Experience > 3.50 | | | | | | | |--- County_Santa Clara County <= 0.50 | | | | | | | | |--- Online <= 0.50 | | | | | | | | | |--- Family <= 3.00 | | | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | | | |--- Family > 3.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- Online > 0.50 | | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | | |--- ZIPCode <= 95272.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- ZIPCode > 95272.00 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | | |--- Experience <= 13.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- Experience > 13.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- County_Santa Clara County > 0.50 | | | | | | | | |--- Experience <= 5.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- Experience > 5.50 | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Experience > 31.50 | | | | | | |--- weights: [1.80, 0.00] class: 0 | | | |--- CCAvg > 2.45 | | | | |--- ID <= 2303.00 | | | | | |--- ID <= 163.00 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- ID > 163.00 | | | | | | |--- County_Other County <= 0.50 | | | | | | | |--- weights: [0.00, 12.75] class: 1 | | | | | | |--- County_Other County > 0.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- ID > 2303.00 | | | | | |--- Experience <= 35.50 | | | | | | |--- Age <= 34.50 | | | | | | | |--- Experience <= 2.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Experience > 2.50 | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | |--- Age > 34.50 | | | | | | | |--- County_Santa Clara County <= 0.50 | | | | | | | | |--- Income <= 100.00 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- Income > 100.00 | | | | | | | | | |--- ZIPCode <= 90154.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- ZIPCode > 90154.50 | | | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 5.95] class: 1 | | | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- County_Santa Clara County > 0.50 | | | | | | | | |--- Experience <= 11.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- Experience > 11.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Experience > 35.50 | | | | | | |--- ZIPCode <= 94275.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- ZIPCode > 94275.50 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | |--- Income > 114.50 | | | |--- Income <= 116.50 | | | | |--- Mortgage_Yes <= 0.50 | | | | | |--- ID <= 304.00 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- ID > 304.00 | | | | | | |--- ZIPCode <= 90060.00 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- ZIPCode > 90060.00 | | | | | | | |--- Experience <= 32.00 | | | | | | | | |--- weights: [0.00, 6.80] class: 1 | | | | | | | |--- Experience > 32.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- Mortgage_Yes > 0.50 | | | | | |--- Age <= 32.00 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Age > 32.00 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 201.45] class: 1
The tree above is fully grown hence,complex and difficult to interpret.
#Gini importance of features in the tree building
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 5.619161e-01 Family 1.563385e-01 Education 1.291092e-01 CCAvg 8.804473e-02 Experience 1.766618e-02 ID 1.029078e-02 County_Santa Clara County 7.854868e-03 ZIPCode 7.288236e-03 County_Orange County 4.430490e-03 Age 3.518669e-03 Securities_Account 3.011105e-03 CreditCard 2.181233e-03 Online 2.074456e-03 Mortgage_Yes 1.261141e-03 County_Yolo County 1.177442e-03 County_Contra Costa County 1.143064e-03 County_Other County 7.377706e-04 County_Sacramento County 7.049808e-04 County_Marin County 6.303210e-04 County_Santa Barbara County 6.206978e-04 County_Los Angeles County 2.817669e-18 County_Alameda County 1.408835e-18 County_San Mateo County 0.000000e+00 County_Ventura County 0.000000e+00 County_Santa Cruz County 0.000000e+00 CD_Account 0.000000e+00 County_San Bernardino County 0.000000e+00 County_San Francisco County 0.000000e+00 County_San Diego County 0.000000e+00 County_Riverside County 0.000000e+00 Mortgage_No 0.000000e+00 County_Monterey County 0.000000e+00 County_Kern County 0.000000e+00
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 10))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
- According to the decision tree model, income if the most important variable for predicting if a customer will take a personal Loan.
Model performance improvement Gridsearch for Hyperparameter tuning to reduce overfitting
from sklearn.metrics import make_scorer, recall_score # Import make_scorer and recall_score
#choose the type of classifier
estimator = DecisionTreeClassifier(random_state=42)
#model = DecisionTreeClassifier(random_state=42
#grid of patameter to choose from
parameters = {
"max_depth": [5, 10, 15, None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.00001, 0.0001, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)
# Run the grid search
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
grid_object = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
df_grid = grid_object.fit(X_train, y_train)
# The error was in the next line, grid_obj was not defined,
# it should be grid_object
grid_object = grid_object.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_object.best_estimator_
# Fit the best algorithm to the data
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.85}, max_depth=5,
min_impurity_decrease=1e-05, random_state=42)
DecisionTreeClassifier(class_weight={0: 0.85}, max_depth=5,
min_impurity_decrease=1e-05, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.85}, max_depth=5,
min_impurity_decrease=1e-05, random_state=42)Model performance on training set
decision_tree_tune_performance_train = model_performance_classification(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
def confusion_matrix_sklearn(model, predictors, target):
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np
np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()
return
confusion_matrix_sklearn(estimator, X_train, y_train)
# Model performance on test set
decision_tree_tune_perf_test = model_performance_classification(
estimator, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.986 | 0.904762 | 0.959596 | 0.931373 |
from sklearn.metrics import make_scorer, recall_score # Import make_scorer and recall_score
#choose the type of classifier
estimator = DecisionTreeClassifier(random_state=42)
#model = DecisionTreeClassifier(random_state=42
#grid of patameter to choose from
parameters = {
"max_depth": [5, 10, 15, None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.00001, 0.0001, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)
# Run the grid search
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
df_grid = grid_obj.fit(X_train, y_train)
# Access the best estimator
best_estimator = grid_obj.best_estimator_
def confusion_matrix_sklearn(model, predictors, target):
"""
Generates and displays a confusion matrix using Seaborn's heatmap.
Args:
model: The trained classification model.
predictors: The predictor variables (X).
target: The target variable (y).
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
# Reshape labels to match cm shape
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(cm.shape) # Reshape labels to match cm's shape
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.show()
return
confusion_matrix_sklearn(best_estimator, X_test, y_test)
Visualizing the decision tree¶
plt.figure(figsize=(12, 8))
# Pass the best_estimator instead of estimator
out = tree.plot_tree(
best_estimator, # Changed to best_estimator
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Test report showing the rules of a decision tree
print(tree.export_text(best_estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 113.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2892.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- County_Santa Barbara County <= 0.50 | | | | |--- Education <= 1.50 | | | | | |--- Family <= 3.50 | | | | | | |--- weights: [35.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Education > 1.50 | | | | | |--- County_Other County <= 0.50 | | | | | | |--- County_Ventura County <= 0.50 | | | | | | | |--- CCAvg <= 2.45 | | | | | | | | |--- ID <= 1276.50 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | |--- ID > 1276.50 | | | | | | | | | |--- ZIPCode <= 90069.00 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- ZIPCode > 90069.00 | | | | | | | | | | |--- weights: [20.00, 4.00] class: 0 | | | | | | | |--- CCAvg > 2.45 | | | | | | | | |--- Family <= 1.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- Family > 1.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- County_Ventura County > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- County_Other County > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- County_Santa Barbara County > 0.50 | | | | |--- Mortgage_Yes <= 0.50 | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Mortgage_Yes > 0.50 | | | | | |--- weights: [0.00, 2.00] class: 1 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- Income <= 82.50 | | | | |--- Age <= 28.00 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 28.00 | | | | | |--- Experience <= 8.50 | | | | | | |--- Family <= 3.00 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Family > 3.00 | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | |--- Experience > 8.50 | | | | | | |--- ZIPCode <= 94657.50 | | | | | | | |--- ID <= 1071.50 | | | | | | | | |--- ID <= 1030.50 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | |--- ID > 1030.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- ID > 1071.50 | | | | | | | | |--- weights: [50.00, 0.00] class: 0 | | | | | | |--- ZIPCode > 94657.50 | | | | | | | |--- ZIPCode <= 94714.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- ZIPCode > 94714.50 | | | | | | | | |--- CCAvg <= 3.10 | | | | | | | | | |--- Education <= 2.50 | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | |--- Education > 2.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- CCAvg > 3.10 | | | | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | |--- Income > 82.50 | | | | |--- Family <= 2.50 | | | | | |--- Experience <= 33.50 | | | | | | |--- County_Riverside County <= 0.50 | | | | | | | |--- Experience <= 3.50 | | | | | | | | |--- ID <= 2248.00 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- ID > 2248.00 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- Experience > 3.50 | | | | | | | | |--- County_Sacramento County <= 0.50 | | | | | | | | | |--- Income <= 110.50 | | | | | | | | | | |--- weights: [63.00, 1.00] class: 0 | | | | | | | | | |--- Income > 110.50 | | | | | | | | | | |--- weights: [8.00, 2.00] class: 0 | | | | | | | | |--- County_Sacramento County > 0.50 | | | | | | | | | |--- Family <= 1.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Family > 1.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- County_Riverside County > 0.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Experience > 33.50 | | | | | | |--- Education <= 1.50 | | | | | | | |--- ZIPCode <= 92321.50 | | | | | | | | |--- ID <= 1537.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- ID > 1537.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- ZIPCode > 92321.50 | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- Education > 1.50 | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- Family > 2.50 | | | | | |--- Age <= 57.00 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- ID <= 2630.00 | | | | | | | | |--- Income <= 102.00 | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | | |--- Income > 102.00 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- ID > 2630.00 | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | |--- Age > 57.00 | | | | | | |--- CCAvg <= 3.20 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CCAvg > 3.20 | | | | | | | |--- ZIPCode <= 91383.00 | | | | | | | | |--- Experience <= 34.00 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- Experience > 34.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- ZIPCode > 91383.00 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 3.85 | | | | |--- weights: [0.00, 8.00] class: 1 | | | |--- CCAvg > 3.85 | | | | |--- CCAvg <= 4.05 | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | |--- CCAvg > 4.05 | | | | | |--- ID <= 2505.50 | | | | | | |--- CreditCard <= 0.50 | | | | | | | |--- ID <= 2348.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- ID > 2348.00 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CreditCard > 0.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- ID > 2505.50 | | | | | | |--- weights: [0.00, 4.00] class: 1 |--- Income > 113.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [463.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 59.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- Experience <= 32.00 | | | | |--- CCAvg <= 2.80 | | | | | |--- Experience <= 18.00 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | |--- Experience > 18.00 | | | | | | |--- ID <= 2052.00 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- ID > 2052.00 | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- CCAvg > 2.80 | | | | | |--- weights: [0.00, 6.00] class: 1 | | | |--- Experience > 32.00 | | | | |--- weights: [6.00, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 237.00] class: 1
# Gini importance of features in the tree building
print(
pd.DataFrame(
best_estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education 0.373863 Income 0.296189 Family 0.182161 CCAvg 0.049369 ID 0.021313 Experience 0.020157 CD_Account 0.014389 Age 0.009617 ZIPCode 0.008576 Online 0.008211 County_Riverside County 0.004878 County_Santa Barbara County 0.002525 County_Ventura County 0.002010 Mortgage_Yes 0.001996 County_Other County 0.001917 CreditCard 0.001597 County_Sacramento County 0.001231 County_Santa Cruz County 0.000000 County_Santa Clara County 0.000000 County_San Mateo County 0.000000 County_San Francisco County 0.000000 County_San Diego County 0.000000 County_San Bernardino County 0.000000 County_Kern County 0.000000 County_Orange County 0.000000 County_Monterey County 0.000000 County_Marin County 0.000000 County_Los Angeles County 0.000000 County_Contra Costa County 0.000000 County_Alameda County 0.000000 Mortgage_No 0.000000 Securities_Account 0.000000 County_Yolo County 0.000000
Observations: In tuned decision tree, Education and Income are the most important features, followed by Family size.
import matplotlib.pyplot as plt
from sklearn import tree
# Create a figure
fig = plt.subplots(figsize=(12, 8))
# Plot the decision tree
out = tree.plot_tree(
best_estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
pd.DataFrame(
best_estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
| Imp | |
|---|---|
| Education | 0.373863 |
| Income | 0.296189 |
| Family | 0.182161 |
| CCAvg | 0.049369 |
| ID | 0.021313 |
| Experience | 0.020157 |
| CD_Account | 0.014389 |
| Age | 0.009617 |
| ZIPCode | 0.008576 |
| Online | 0.008211 |
| County_Riverside County | 0.004878 |
| County_Santa Barbara County | 0.002525 |
| County_Ventura County | 0.002010 |
| Mortgage_Yes | 0.001996 |
| County_Other County | 0.001917 |
| CreditCard | 0.001597 |
| County_Sacramento County | 0.001231 |
| County_Santa Cruz County | 0.000000 |
| County_Santa Clara County | 0.000000 |
| County_San Mateo County | 0.000000 |
| County_San Francisco County | 0.000000 |
| County_San Diego County | 0.000000 |
| County_San Bernardino County | 0.000000 |
| County_Kern County | 0.000000 |
| County_Orange County | 0.000000 |
| County_Monterey County | 0.000000 |
| County_Marin County | 0.000000 |
| County_Los Angeles County | 0.000000 |
| County_Contra Costa County | 0.000000 |
| County_Alameda County | 0.000000 |
| Mortgage_No | 0.000000 |
| Securities_Account | 0.000000 |
| County_Yolo County | 0.000000 |
The best estimator feature importances value for all features shows in the data is 0.
!pip install scikit-learn matplotlib
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Assuming you have X_train, y_train from previous steps
# Create a decision tree classifier
clf = DecisionTreeClassifier(random_state=42)
# Get the pruning path
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
# Plot total impurity vs effective alpha
plt.figure(figsize=(10, 6))
plt.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
plt.xlabel("Effective Alpha")
plt.ylabel("Total Impurity of Leaves")
plt.title("Total Impurity vs Effective Alpha for Training Set")
plt.show()
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.5.2) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1) Requirement already satisfied: numpy>=1.19.5 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.26.4) Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.13.1) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.5.0) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.3.0) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.53.1) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.7) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (24.1) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.4) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (2.8.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
We need to train the decision tree using the effective alphas. The last value in the alphas is the alpha value that prunes the whole tree, leaving the tree clfs[-1], with one node
# Training the decision tree using the effective alphas
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.15, 1: 0.85}
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is : {} with ccp_alpha : {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is : 7 with ccp_alpha : 0.05214930148943829
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[ :-1]
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
The maximum value of Recall is around 0.01 alpha, for both train and test sets
# Creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.005758202713680605,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
best_model.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.005758202713680605,
class_weight={0: 0.15, 1: 0.85}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.005758202713680605,
class_weight={0: 0.15, 1: 0.85}, random_state=1)Model performance on training set¶
decision_tree_postpruned_perf_train = model_performance_classification(
best_model, X_train, y_train
)
decision_tree_postpruned_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.93525 | 0.992 | 0.592357 | 0.741775 |
confusion_matrix_sklearn(best_model, X_train, y_train)
Model performance on test set¶
decision_tree_postpruned_perf_test = model_performance_classification(
best_model, X_test, y_test
)
decision_tree_postpruned_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.941 | 0.980952 | 0.64375 | 0.777358 |
confusion_matrix_sklearn(best_model, X_test, y_test)
Observations
- The results have improved from the initial model
- The model is giving a generalized performance on training and test set
#Visualizing the decision tree
plt.figure(figsize=(10, 8))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
plt.show()
# Text report showing the rules of a decision tree
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [424.80, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- weights: [21.15, 25.50] class: 1 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [80.55, 2.55] class: 0 | | |--- Family > 2.50 | | | |--- weights: [2.85, 54.40] class: 1 | |--- Education > 1.50 | | |--- Income <= 114.50 | | | |--- weights: [13.65, 28.05] class: 1 | | |--- Income > 114.50 | | | |--- weights: [0.75, 208.25] class: 1
# Gini importance of features in the tree building
print(
pd.DataFrame(
best_model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.614717 Family 0.164455 Education 0.148762 CCAvg 0.072066 ID 0.000000 County_San Francisco County 0.000000 County_Other County 0.000000 County_Riverside County 0.000000 County_Sacramento County 0.000000 County_San Bernardino County 0.000000 County_San Diego County 0.000000 County_Santa Barbara County 0.000000 County_San Mateo County 0.000000 County_Monterey County 0.000000 County_Santa Clara County 0.000000 County_Santa Cruz County 0.000000 County_Ventura County 0.000000 County_Orange County 0.000000 County_Kern County 0.000000 County_Marin County 0.000000 County_Los Angeles County 0.000000 Age 0.000000 County_Contra Costa County 0.000000 County_Alameda County 0.000000 Mortgage_Yes 0.000000 Mortgage_No 0.000000 CreditCard 0.000000 Online 0.000000 CD_Account 0.000000 Securities_Account 0.000000 ZIPCode 0.000000 Experience 0.000000 County_Yolo County 0.000000
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import tree
# Assuming 'best_model', 'X_train', and 'feature_names' are already defined
importances = best_model.feature_importances_
indices = np.argsort(importances)
# Instead of calling the undefined 'plot_feature_importances',
# we directly create the plot using matplotlib:
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations
Income is the most important feature to predict if the customer will borrow a personal loan Followed by Family size, Education and the CCAvg.
# Training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_postpruned_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree Sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison : ")
models_train_comp_df
Training performance comparison :
| Decision Tree Sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 1.0 | 1.0 | 0.935250 |
| Recall | 1.0 | 1.0 | 0.992000 |
| Precision | 1.0 | 1.0 | 0.592357 |
| F1 | 1.0 | 1.0 | 0.741775 |
Model Performance Comparison and Final Model Selection¶
# Test performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_postpruned_perf_test.T, # Fixed: This line was split and incorrectly indented
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree Sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test performance comparison : ")
models_test_comp_df
Test performance comparison :
| Decision Tree Sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.982000 | 0.986000 | 0.941000 |
| Recall | 0.877358 | 0.904762 | 0.980952 |
| Precision | 0.948980 | 0.959596 | 0.643750 |
| F1 | 0.911765 | 0.931373 | 0.777358 |
Conclusion Decision tree model with pre-pruning has given the best recall score on training data. Decision tree model with post-pruning has given the best recall score on test set. The tree with post pruning is not complex and is easy to interpret.
Logistic Regression and Decision Tree¶
Model comparison¶
Logistic Regression
Outliers were treated Variables with multicollinearity and a p>0.05 were dropped in the process of model building Highest accuracy in post pruning is 0.99 and recall is 0.91. Able to interpret that a unit increase in any of the variables (Income, Family, CCAvg, CD_Account, Education levels, Riverside County) will increase the odds of a customer taking a personal loan.
Decision Tree
Variables with outliers and multicollinearity were excluded as they did not impact the Decision Tree. The highest recall achieved on the test set was 0.99 using a post-pruned decision tree, yielding consistent results across train and test sets. Notably, Income, Family, Education, and CCAvg emerged as key predictors of a customer's likelihood to borrow a personal loan.
Actionable Insights and Business Recommendations¶
Recommendation¶
The marketing team of the bank should target customers with the following:
- The bank's marketing efforts should concentrate on high-value client characterized by annual incomes over 120,000 graduate or advanced/professional education,family siced of three or more,existing certificate of deposit accounts,and substntial credit cards pending(above $2500/month).
- Additional targeting criteria include mortgage holders, individuals aged 35+, those with 9+ years of experience, and residents of LA, Santa Clara, San Diego, and Riverside Counties for enhanced effectiveness.